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Abstract 


This  survey  covers  rollback-recovery  techniques  that  do  not  require  special  language  constructs. 
In  the  first  part  of  the  survey  we  classify  rollback-recovery  protocols  into  checkpoint-based  and 
log-based.  Checkpoint-based  protocols  rely  solely  on  checkpointing  for  system  state  restoration. 
Checkpointing  can  be  coordinated,  uncoordinated,  or  communication-induced.  Log-based 
protocols  combine  checkpointing  with  logging  of  nondeterministic  events,  encoded  in  tuples 
called  determinants.  Depending  on  how  determinants  are  logged,  log-based  protocols  can  be 
pessimistic,  optimistic,  or  causal.  Throughout  the  survey,  we  highlight  the  research  issues  that  are 
at  the  core  of  rollback  recovery  and  present  the  solutions  that  currently  address  them.  We  also 
compare  the  performance  of  different  rollback-recovery  protocols  with  respect  to  a  series  of 
desirable  properties  and  discuss  the  issues  that  arise  in  the  practical  implementations  of  these 
protocols. 


1  Introduction 


Distributed  systems  today  are  ubiquitous  and  enable  many  applications,  including  client-server  systems,  transaction 
processing,  World  Wide  Web,  and  scientific  computing,  among  many  others.  The  vast  computing  potential  of  these 
systems  is  often  hampered  by  their  susceptibility  to  failures.  Therefore,  many  techniques  have  been  developed  to 
add  reliability  and  high  availability  to  distributed  systems.  These  techniques  include  transactions,  group 
communications  and  rollback  recovery,  and  have  different  tradeoffs  and  focuses.  For  example,  transactions  focus 
on  data-oriented  applications,  while  group  communications  offer  an  abstraction  of  an  ideal  communication  system 
on  top  of  which  programmers  can  develop  reliable  applications.  This  survey  covers  transparent  rollback  recovery, 
which  focuses  on  long-running  applications  such  as  scientific  computing  and  telecommunication  applications 
[26][43]. 

Rollback  recovery  treats  a  distributed  system  as  a  collection  of  application  processes  that  communicate 
through  a  network.  Fault  tolerance  is  achieved  by  periodically  using  stable  storage  to  save  the  processes’  states 
during  failure-free  execution.  Upon  a  failure,  a  failed  process  restarts  from  one  of  its  saved  states,  thereby  reducing 
the  amount  of  lost  computation.  Each  of  the  saved  states  is  called  a  checkpoint. 

Rollback  recovery  has  many  flavors.  For  example,  a  system  may  rely  on  the  application  to  decide  when  and 
what  to  save  on  stable  storage.  Or,  it  may  provide  the  application  programmer  with  linguistic  constructs  to  structure 
the  application  [47].  We  focus  in  this  survey  on  transparent  techniques,  which  do  not  require  any  intervention  on 
the  part  of  the  application  or  the  programmer.  The  system  automatically  takes  checkpoints  according  to  some 
specified  policy,  and  recovers  automatically  from  failures  if  they  occur.  This  approach  has  the  advantages  of 
relieving  the  application  programmers  from  the  complex  and  error-prone  chores  of  implementing  fault  tolerance  and 
of  offering  fault  tolerance  to  existing  applications  written  without  consideration  to  reliability  concerns. 

Rollback  recovery  has  been  studied  in  various  forms  and  in  connection  with  many  fields  of  research.  Thus,  it 
is  perhaps  impossible  to  provide  an  extensive  coverage  of  all  the  issues  related  to  rollback  recovery  within  the  scope 
of  one  article.  This  survey  concentrates  on  the  definitions,  fundamental  concepts,  and  implementation  issues  of 
rollback-recovery  protocols  in  distributed  systems.  The  coverage  excludes  the  use  of  rollback  recovery  in  many 
related  fields  such  hardware-level  instruction  retry,  distributed  shared  memory  [38],  real-time  systems,  and 
debugging  [37].  The  coverage  also  excludes  the  issues  of  using  rollback  recovery  when  failures  could  include 
Byzantine  modes  or  are  not  restricted  to  the  fail-stop  model  [51].  Also  excluded  are  rollback-recovery  techniques 
that  rely  on  special  language  constructs  such  as  recovery  blocks  [47]  and  transactions.  Finally,  the  section  on 
implementation  exposes  many  relevant  issues  related  to  implementing  checkpointing  on  uniprocessors,  although  the 
coverage  is  by  no  means  an  exhaustive  one  because  of  the  large  number  of  issues  involved. 

Message-passing  systems  complicate  rollback  recovery  because  messages  induce  inter-process  dependencies 
during  failure-free  operation.  Upon  a  failure  of  one  or  more  processes  in  a  system,  these  dependencies  may  force 
some  of  the  processes  that  did  not  fail  to  roll  back,  creating  what  is  commonly  called  rollback  propagation.  To  see 
why  rollback  propagation  occurs,  consider  the  situation  where  a  sender  of  a  message  m  rolls  back  to  a  state  that 
precedes  the  sending  of  m.  The  receiver  of  m  must  also  roll  back  to  a  state  that  precedes  w’s  receipt;  otherwise,  the 
states  of  the  two  processes  would  be  inconsistent  because  they  would  show  that  message  m  was  received  without 
being  sent,  which  is  impossible  in  any  correct  failure-free  execution.  Under  some  scenarios,  rollback  propagation 
may  extend  back  to  the  initial  state  of  the  computation,  losing  all  the  work  performed  before  a  failure.  This  situation 
is  known  as  the  domino  effect  [47]. 

The  domino  effect  may  occur  if  each  process  takes  its  checkpoints  independently — an  approach  known  as 
independent  or  uncoordinated  checkpointing.  It  is  obviously  desirable  to  avoid  the  domino  effect  and  therefore 
several  techniques  have  been  developed  to  prevent  it.  One  such  technique  is  to  perform  coordinated  checkpointing 
in  which  processes  coordinate  their  checkpoints  in  order  to  save  a  system- wide  consistent  state  [16].  This  consistent 
set  of  checkpoints  can  then  be  used  to  bound  rollback  propagation.  Alternatively,  communication-induced 
checkpointing  forces  each  process  to  take  checkpoints  based  on  information  piggybacked  on  the  application 
messages  it  receives  from  other  processes  [50].  Checkpoints  are  taken  such  that  a  system- wide  consistent  state 
always  exists  on  stable  storage,  thereby  avoiding  the  domino  effect. 

The  approaches  discussed  so  far  implement  checkpoint-based  rollback  recovery,  which  relies  only  on 
checkpoints  to  achieve  fault-tolerance.  In  contrast,  log-based  rollback  recovery  combines  checkpointing  with 
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logging  of  nondeterministic  events.^  Log-based  rollback  recovery  relies  on  the  piecewise  deterministic  {PWD) 
assumption  [56],  which  postulates  that  all  nondeterministic  events  that  a  process  executes  can  be  identified  and  that 
the  information  necessary  to  replay  each  event  during  recovery  can  be  logged  in  the  event’s  determinant  [1].  By 
logging  and  replaying  the  nondeterministic  events  in  their  exact  original  order,  a  process  can  deterministically 
recreate  its  pre-failure  state  even  if  this  state  has  not  been  checkpointed.  Log-based  rollback  recovery  in  general 
enables  a  system  to  recover  beyond  the  most  recent  set  of  consistent  checkpoints.  It  is  therefore  particularly 
attractive  for  applications  that  frequently  interact  with  the  outside  world,  which  consists  of  all  input  and  output 
devices  that  cannot  roll  back.  Log-based  rollback  recovery  has  three  flavors,  depending  on  how  the  determinants  are 
logged  to  stable  storage.  In  pessimistic  logging,  the  application  has  to  block  waiting  for  the  determinant  of  each 
nondeterministic  event  to  be  stored  on  stable  storage  before  the  effects  of  that  event  can  be  seen  by  other  processes 
or  the  outside  world.  Pessimistic  logging  simplifies  recovery  but  hurts  failure-free  performance.  In  optimistic 
logging,  the  application  does  not  block,  and  determinants  are  spooled  to  stable  storage  asynchronously.  Optimistic 
logging  reduces  the  failure-free  overhead,  but  complicates  recovery.  Finally,  in  causal  logging,  low  failure-free 
overhead  and  simpler  recovery  are  combined  by  striking  a  balance  between  optimistic  and  pessimistic  logging.  The 
three  flavors  also  differ  in  their  requirements  for  garbage  collection  and  their  interactions  with  the  outside  world,  as 
will  be  explained  later. 

The  outline  of  the  rest  of  the  survey  is  as  follows: 

•  Section  2:  system  model,  terminology  and  generic  issues  in  rollback  recovery. 

•  Section  3:  checkpoint-based  rollback-recovery  protocols. 

•  Section  4:  log-based  rollback-recovery  protocols. 

•  Section  5:  implementation  issues. 

•  Section  6:  conclusions. 

2  Background  and  Definitions 

2.1  System  Model 

A  message-passing  system  consists  of  a  fixed  number  of  processes  that  communicate  only  through  messages. 
Throughout  this  survey,  we  use  N  to  denote  the  total  number  of  processes  in  a  system.  Processes  cooperate  to 
execute  a  distributed  application  program  and  interact  with  the  outside  world  by  receiving  and  sending  input  and 
output  messages,  respectively.  Figure  1  shows  a  sample  system  consisting  of  three  processes,  where  horizontal  lines 
extending  toward  the  right-hand  side  represent  the  execution  of  each  process,  and  arrows  between  processes 
represent  messages. 

Rollback-recovery  protocols  generally  assume  that  the  communication  network  is  immune  to  partitioning  but 
differ  in  the  assumptions  they  make  about  network  reliability.  Some  protocols  assume  that  the  communication 
subsystem  delivers  messages  reliably,  in  First-In-First-Out  (FIFO)  order  [16],  while  other  protocols  assume  that  the 
communication  subsystem  can  lose,  duplicate,  or  reorder  messages  [28].  The  choice  between  these  two  assumptions 
usually  affects  the  complexity  of  recovery  and  its  implementation  in  different  ways.  Generally,  assuming  a  reliable 
network  simplifies  the  design  of  the  recovery  protocol  but  introduces  implementation  complexities  that  will  be 
described  in  Section  2.4  and  Section  5.4.2. 

A  process  execution  is  a  sequence  of  state  intervals,  each  started  by  a  nondeterministic  event.  Execution 
during  each  state  interval  is  deterministic,  such  that  if  a  process  starts  from  the  same  state  and  is  subjected  to  the 
same  nondeterministic  events  at  the  same  locations  within  the  execution,  it  will  always  yield  the  same  output.  A 
process  may  fail,  in  which  case  it  loses  its  volatile  state  and  stops  execution  according  to  the  fail-stop  model  [51], 
Processes  have  access  to  a  stable  storage  device  that  survives  failures,  such  that  state  information  saved  on  this 
device  during  failure-free  execution  can  be  used  for  recovery.  The  number  of  tolerated  process  failures  may  vary 
from  \  Xo  N,  and  the  recovery  protocol  needs  to  be  designed  accordingly.  Furthermore,  some  protocols  may  not 
tolerate  failures  that  occur  during  recovery. 


’  Earlier  papers  in  this  area  have  assumed  a  model  in  which  the  occurrence  of  a  nondeterministic  event  is  modeled  as  a  message 
receipt.  In  this  model,  nondeterministic-event  logging  reduces  to  message  logging.  In  this  paper,  we  use  the  terms  event  logging 
and  message  logging  interchangeably. 
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Figure  1,  An  example  of  a  message-passing  system  with  three  processes. 


2.2  Consistent  System  States 

A  global  state  of  a  message-passing  system  is  a  collection  of  the  individual  states  of  all  participating  processes  and 
of  the  states  of  the  communication  channels.  Intuitively,  a  consistent  global  state  is  one  that  may  occur  during  a 
failure-free,  correct  execution  of  a  distributed  computation.  More  precisely,  a  consistent  system  state  is  one  in 
which  if  a  process’s  state  reflects  a  message  receipt,  then  the  state  of  the  corresponding  sender  reflects  sending  that 
message  [16].  For  example.  Figure  2  shows  two  examples  of  global  states — a  consistent  state  in  Figure  2(a),  and  an 
inconsistent  state  in  Figure  2(b).  Note  that  the  consistent  state  in  Figure  2(a)  shows  message  m\  to  have  been  sent 
but  not  yet  received.  This  state  is  consistent,  because  it  represents  a  situation  in  which  the  message  has  left  the 
sender  and  is  still  traveling  across  the  network.  On  the  other  hand,  the  state  in  Figure  2(b)  is  inconsistent  because 
process  P2  is  shown  to  have  received  m2  but  the  state  of  process  P\  does  not  reflect  sending  it.  Such  a  state  is 
impossible  in  any  failure-free,  correct  computation. 

Inconsistent  states  occur  because  of  failures.  For  example,  the  situation  shown  in  part  (b)  of  Figure  2  may 
occur  if  process  Pi  fails  after  sending  message  m2  to  Pj  and  then  restarts  at  the  state  shown  in  the  figure. 

A  fundamental  goal  of  any  rollback-recovery  protocol  is  to  bring  the  system  into  a  consistent  state  when 
inconsistencies  occur  because  of  a  failure.  The  reconstructed  consistent  state  is  not  necessarily  one  that  has  occurred 
before  the  failure.  It  is  sufficient  that  the  reconstructed  state  be  one  that  could  have  occurred  before  the  failure  in  a 
failure-free,  correct  execution. 


Figure  2.  An  example  of  a  consistent  and  inconsistent  state. 
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2.3  Z-Paths  and  Z-Cycles 

A  Z-path  (zigzag  path)  is  a  special  sequence  of  messages  that  connects  two  checkpoints  [41].  Let  •->  denote 
Lamport’s  happen-before  relation  [34].  Let  C/ y  denote  the  .v*  checkpoint  of  process  Pj.  Also,  define  the  execution 
portion  between  two  consecutive  checkpoints  on  the  same  process  to  be  the  checkpoint  interval  starting  with  the 
earlier  checkpoint.  Given  two  checkpoints  C/Vv  and  C/.v,  a  Z-path  exists  between  c,;y  and  Cy if  and  only  if  one  of  the 
following  two  conditions  holds: 

1.  x<yandi=j;or 

2.  There  exists  a  sequence  of  messages  [wo,  mi,...,  m;,],  «  >  0,  such  that: 

•  Ci,x  sendiCmo); 

•  V  /  <  « ,  either  deliveriimi)  and  sendiimi+{)  are  in  the  same  checkpoint  interval,  or  deliverf,(nii) 
sendf,(mi+i);  and 

•  deliverj(mn)  Cj^y 

where  sendi  and  deUveVi  are  communication  events  executed  by  process  F,.  In  Figure  3,  [m^,  and  [m3,  nu]  are 
examples  of  Z-paths  between  checkpoints  Co,i  and  C20. 

A  Z-cycle  is  a  Z-path  that  begins  and  ends  with  the  same  checkpoint.  In  Figure  3,  the  Z-path  [^^5,  m3,  m4]  is  a 
Z-cycle  that  starts  and  ends  at  checkpoint  ^2,2- 

The  Z-cycle  theory  was  first  introduced  as  a  framework  for  reasoning  about  consistent  system  states. 
Recently,  the  theory  has  proved  a  powerful  tool  for  reasoning  about  a  class  of  protocols  known  as  communication- 
induced  checkpointing  [5]  [24].  In  particular,  it  has  been  proven  that  a  checkpoint  involved  in  a  Z-cycle  cannot 
become  part  of  a  consistent  state  in  a  system  that  uses  only  checkpoints. 

2.4  In-Transit  Messages 

In  Figure  2(a),  the  global  state  shows  that  message  ni[  has  been  sent  but  not  yet  received.  We  call  such  a  message  an 
in-transit  message.  When  in-transit  messages  are  part  of  a  global  system  state,  these  messages  do  not  cause  any 
inconsistency.  However,  depending  on  whether  the  system  model  assumes  reliable  communication  channels, 
rollback-recovery  protocols  may  have  to  guarantee  the  delivery  of  in-transit  messages  when  failures  occur.  For 
example,  the  rollback-recovery  protocol  in  Figure  4(a)  assumes  reliable  communications,  and  therefore  it  must  be 
implemented  on  top  of  a  reliable  communication  protocol  layer.  In  contrast,  the  rollback-recovery  protocol  in 
Figure  4(b)  does  not  assume  reliable  communications. 

Reliable  communication  protocols  ensure  the  reliability  of  message  delivery  during  failure-free  executions. 
They  cannot,  however,  ensure  by  themselves  the  reliability  of  message  delivery  in  the  presence  of  process  failures. 
For  instance,  if  an  in-transit  message  is  lost  because  the  intended  receiver  has  failed,  conventional  communication 
protocols  will  generate  a  timeout  and  inform  the  sender  that  the  message  cannot  be  delivered.  In  a  rollback-recovery 
system,  however,  the  receiver  will  eventually  recover.  Therefore,  the  system  must  mask  the  timeout  from  the 
application  program  at  the  sender  process  and  must  make  in-transit  messages  available  to  the  intended  receiver 
process  after  it  recovers,  in  order  to  ensure  a  consistent  view  of  the  reliable  system.  On  the  other  hand,  if  a  system 


Figure  3.  An  example  execution  and  Z-paths. 
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Figure  4.  Implementation  of  rollback-recovery  (a)  on  top  of  a  reliable  communication  protocol; 
(b)  directly  on  top  of  unreliable  communication  channels. 


model  assumes  unreliable  communication  channels,  as  in  Figure  4(b),  then  the  recovery  protocol  need  not  handle  in¬ 
transit  messages  in  any  special  way.  Indeed,  lost  in-transit  messages  because  of  process  failures  cannot  be 
distinguished  from  those  caused  by  communication  failures  in  an  unreliable  communication  channel.  Therefore,  the 
loss  of  in-transit  messages  due  to  either  communication  or  process  failure  is  an  event  that  can  occur  in  any  failure- 
free,  correct  execution  of  the  system. 

2.5  Checkpointing  Protocols  and  the  Domino  Effect 

In  checkpointing  protocols,  each  process  periodically  saves  its  state  on  stable  storage.  The  saved  state  contains 
sufficient  information  to  restart  process  execution.  A  consistent  global  checkpoint  is  a  set  of  N  local  checkpoints, 
one  from  each  process,  forming  a  consistent  system  state.  Any  consistent  global  checkpoint  can  be  used  to  restart 
process  execution  upon  a  failure.  Nevertheless,  it  is  desirable  to  minimize  the  amount  of  lost  work  by  restoring  the 
system  to  the  most  recent  consistent  global  checkpoint,  which  is  called  the  recovery  line  [47]. 

Processes  may  coordinate  their  checkpoints  to  form  consistent  states,  or  may  take  checkpoints  independently 
and  search  for  a  consistent  state  during  recovery  out  of  the  set  of  saved  individual  checkpoints.  The  second  style, 
however,  can  lead  to  the  domino  effect  [47].  For  example.  Figure  5  shows  an  execution  in  which  processes  take 
their  checkpoints — ^represented  by  black  bars — without  coordinating  with  each  other.  Each  process  starts  its 
execution  with  an  initial  checkpoint.  Suppose  process  P2  fails  and  rolls  back  to  checkpoint  C.  The  rollback 
“invalidates”  the  sending  of  message  and  so  Pi  must  roll  back  to  checkpoint  B  to  “invalidate”  the  receipt  of  that 
message.  Thus,  the  invalidation  of  message  propagates  the  rollback  of  process  Pi  to  process  Pi,  which  in  turn 
“invalidates”  message  m-j  and  forces  Po  to  roll  back  as  well. 


Recovery  line 


Figure  5.  Rollback  propagation,  recovery  line  and  the  domino  effect. 
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This  cascaded  rollback  may  continue  and  eventually  may  lead  to  the  domino  effect,  which  causes  the  system 
to  roll  back  to  the  beginning  of  the  computation,  in  spite  of  all  the  saved  checkpoints.  In  the  example  shown  in 
Figure  5,  cascading  rollbacks  due  to  the  single  failure  of  process  P2  may  result  in  a  recovery  line  that  consists  of  the 
initial  set  of  checkpoints,  effectively  causing  the  loss  of  all  the  work  done  by  all  processes.  To  avoid  the  domino 
effect,  processes  need  either  to  coordinate  their  checkpoints  so  that  the  recovery  line  is  advanced  as  new  checkpoints 
are  taken,  or  to  combine  checkpointing  with  event  logging. 

2.6  Interactions  with  the  Outside  World 

A  message -passing  system  often  interacts  with  the  outside  world  to  receive  input  data  or  show  the  outcome  of  a 
computation.  If  a  failure  occurs,  the  outside  world  cannot  be  relied  on  to  roll  back  [42].  For  example,  a  printer 
cannot  roll  back  the  effects  of  printing  a  character,  and  an  automatic  teller  machine  cannot  recover  the  money  that  it 
dispensed  to  a  customer.  It  is  therefore  necessary  that  the  outside  world  perceive  a  consistent  behavior  of  the  system 
despite  failures.  Thus,  before  sending  output  to  the  outside  world,  the  system  must  ensure  that  the  state  from  which 
the  output  is  sent  will  be  recovered  despite  any  future  failure.  This  is  commonly  called  the  output  commit  problem 
[56].  Similarly,  input  messages  that  a  system  receives  from  the  outside  world  may  not  be  reproducible  during 
recovery,  because  it  may  not  be  possible  for  the  outside  world  to  regenerate  them.  Thus,  recovery  protocols  must 
arrange  to  save  these  input  messages  so  that  they  can  be  retrieved  when  needed  for  execution  replay  after  a  failure. 
A  common  approach  is  to  save  each  input  message  on  stable  storage  before  allowing  the  application  program  to 
process  it. 

Rollback-recovery  protocols,  therefore,  must  provide  special  treatment  for  interactions  with  the  outside 
world.  There  are  two  metrics  that  express  the  impact  of  this  special  treatment,  namely  the  latency  of  input/output 
and  the  resulting  slowdown  of  system’s  execution  during  input/output.  The  first  metric  represents  the  time  it  takes 
for  an  output  message  to  be  released  to  the  outside  world  after  it  has  been  issued  by  the  system,  or  the  time  it  takes  a 
process  to  consume  an  input  message  after  it  has  been  posted  to  the  system.  The  second  metric  represents  the 
overhead  that  the  system  incurs  to  ensure  that  input  and  output  actions  will  have  a  persistent  effect  despite  future 
failures. 

2.7  Logging  Protocols 

Log-based  rollback  recovery  uses  checkpointing  and  logging  to  enable  processes  to  replay  their  execution  after  a 
failure  beyond  the  most  recent  checkpoint.  This  is  useful  when  interactions  with  the  outside  world  are  frequent, 
since  it  enables  a  process  to  repeat  its  execution  and  be  consistent  with  output  sent  to  the  outside  world  without 
having  to  take  expensive  checkpoints  before  sending  such  output.  Additionally,  log-based  recovery  generally  is  not 
susceptible  to  the  domino  effect,  thereby  allowing  processes  to  use  uncoordinated  checkpointing  if  desired. 

Log-based  recovery  relies  on  the  piecewise  deterministic  (PWD)  assumption  [56].  Under  this  assumption, 
the  rollback  recovery  protocol  can  identify  all  the  nondeterministic  events  executed  by  each  process,  and  for  each 
such  event,  logs  a  determinant  that  contains  all  information  necessary  to  replay  the  event  should  it  be  necessary 
during  recovery.  If  the  PWD  assumption  holds,  log-based  rollback-recovery  protocols  can  recover  a  failed  process 
and  replay  its  execution  as  it  occurred  before  the  failure. 

Examples  of  nondeterministic  events  include  receiving  messages,  receiving  input  from  the  outside  world,  or 
undergoing  an  internal  state  transfer  within  a  process  based  on  some  nondeterministic  action  such  as  the  receipt  of 
an  interrupt.  Rollback-recovery  implementations  differ  in  the  range  of  actual  nondeterministic  events  that  are 
covered  under  this  model.  For  instance,  a  particular  implementation  may  only  cover  message  receipts  from  other 
processes  under  the  PWD  assumption.  Such  an  implementation  cannot  replay  an  execution  that  is  subjected  to  other 
forms  of  nondeterministic  events  such  as  asynchronous  interrupts.  The  range  of  events  covered  under  the  PWD 
assumption  is  an  implementation  issue  and  is  covered  in  Section  5.7. 

A  state  interval  is  recoverable  if  there  is  sufficient  information  to  replay  the  execution  up  to  that  state  interval 
despite  any  future  failures  in  the  system.  Also,  a  state  interval  is  stable  if  the  determinant  of  the  nondeterministic 
event  that  started  it  is  logged  on  stable  storage  [30].  A  recoverable  state  interval  is  always  stable,  but  the  opposite  is 
not  always  true  [28]. 

Figure  6  shows  an  execution  in  which  the  only  nondeterministic  events  are  message  deliveries.  Suppose  that 
processes  P j  and  P 2  fail  before  logging  the  determinants  corresponding  to  the  deliveries  of  m^  and  ^5,  respectively, 
while  all  other  determinants  survive  the  failure.  Message  m-j  becomes  an  orphan  message  because  process  P2  cannot 
guarantee  the  regeneration  of  the  same  m6  during  recovery,  and  Pi  cannot  guarantee  the  regeneration  of  the  same  m-j 
without  the  original  m(,.  As  a  result,  the  surviving  process  Po  becomes  an  orphan  process  and  is  forced  to  roll  back 
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Figure  6.  Message  logging  for  deterministic  replay. 


as  well.  States  X,  Y  and  Z  form  the  maximum  recoverable  state  [28],  i.e.,  the  most  recent  recoverable  consistent 
system  state.  Processes  Fo  and  P2  roll  back  to  checkpoints  A  and  C,  respectively,  and  replay  the  deliveries  of 
messages  w/4  and  m2,  respectively,  to  reach  states  X  and  Z.  Process  Pi  rolls  back  to  checkpoint  B  and  replays  the 
deliveries  of  mi  and  m^  in  their  original  order  to  reach  state  F. 

During  recovery,  log-based  rollback-recovery  protocols  force  the  execution  of  the  system  to  be  identical  to 
the  one  that  occurred  before  the  failure,  up  to  the  maximum  recoverable  state.  Therefore,  the  system  always 
recovers  to  a  state  that  is  consistent  with  the  input  and  output  interactions  that  occurred  up  to  the  maximum 
recoverable  state. 

2.8  Stable  Storage 

Rollback  recovery  uses  stable  storage  to  save  checkpoints,  event  logs,  and  other  recovery-related  information. 
Stable  storage  in  rollback  recovery  is  only  an  abstraction,  although  it  is  often  confused  with  the  disk  storage  used  to 
implement  it.  Stable  storage  must  ensure  that  the  recovery  data  persist  through  the  tolerated  failures  and  their 
corresponding  recoveries.  This  requirement  can  lead  to  different  implementation  styles  of  stable  storage: 

•  In  a  system  that  tolerates  only  a  single  failure,  stable  storage  may  consist  of  the  volatile  memory  of  another 
process  [11]  [29]. 

•  In  a  system  that  wishes  to  tolerate  an  arbitrary  number  of  transient  failures,  stable  storage  may  consist  of  a  local 
disk  in  each  host. 

•  In  a  system  that  tolerates  non-transient  failures,  stable  storage  must  consist  of  a  persistent  medium  outside  the 
host  on  which  a  process  is  running.  A  replicated  file  system  is  a  possible  implementation  in  such  systems  [35]. 

2.9  Garbage  Collection 

Checkpoints  and  event  logs  consume  storage  resources.  As  the  application  progresses  and  more  recovery 
information  is  collected,  a  subset  of  the  stored  information  may  become  useless  for  recovery.  Garbage  collection  is 
the  deletion  of  such  useless  recovery  information.  A  common  approach  to  garbage  collection  is  to  identify  the 
recovery  line  and  discard  all  information  relating  to  events  that  occurred  before  that  line.  For  example,  processes 
that  coordinate  their  checkpoints  to  form  consistent  states  will  always  restart  from  the  most  recent  checkpoint  of 
each  process,  and  so  all  previous  checkpoints  can  be  discarded.  While  it  has  received  little  attention  in  the  literature, 
garbage  collection  is  an  important  pragmatic  issue  in  rollback-recovery  protocols,  because  running  a  special 
algorithm  to  discard  useless  information  incurs  overhead.  Furthermore,  recovery-protocols  differ  in  the  amount  and 
nature  of  the  recovery  information  they  need  to  store  on  stable  storage,  and  therefore  differ  in  the  complexity  and 
invocation  frequency  of  their  garbage  collection  algorithms. 
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3  Checkpoint-Based  Rollback  Recovery 

Upon  a  failure,  checkpoint-based  rollback  recovery  restores  the  system  state  to  the  most  recent  consistent  set  of 
checkpoints,  i.e.  the  recovery  line  [47].  It  does  not  rely  on  the  PWD  assumption,  and  so  does  not  need  to  detect,  log, 
or  replay  nondeterministic  events.  Checkpoint-based  protocols  are  therefore  less  restrictive  and  simpler  to 
implement  than  log-based  rollback  recovery.  But  checkpoint-based  rollback  recovery  does  not  guarantee  that  pre¬ 
failure  execution  can  be  deterministically  regenerated  after  a  rollback.  Therefore,  checkpoint-based  rollback 
recovery  is  ill  suited  for  applications  that  require  frequent  interactions  with  the  outside  world,  since  such  interactions 
require  that  the  observable  behavior  of  the  system  during  recovery  be  the  same  as  during  failure-free  operation. 

Checkpoint-based  rollback-recovery  techniques  can  be  classified  into  three  categories:  uncoordinated 
checkpointing,  coordinated  checkpointing,  and  communication-induced  checkpointing.  We  examine  each  category 
in  detail. 

3*1  Uncoordinated  Checkpointing 
3,1,1  Overview 

Uncoordinated  checkpointing  allows  each  process  maximum  autonomy  in  deciding  when  to  take  checkpoints.  The 
main  advantage  of  this  autonomy  is  that  each  process  may  take  a  checkpoint  when  it  is  most  convenient.  For 
example,  a  process  may  reduce  the  overhead  by  taking  checkpoints  when  the  amount  of  state  information  to  be 
saved  is  small  [59].  But  there  are  several  disadvantages.  First,  there  is  the  possibility  of  the  domino  effect,  which 
may  cause  the  loss  of  a  large  amount  of  useful  work,  possibly  all  the  way  back  to  the  beginning  of  the  computation. 
Second,  a  process  may  take  a  useless  checkpoint  that  will  never  be  part  of  a  global  consistent  state.  Useless 
checkpoints  are  undesirable  because  they  incur  overhead  and  do  not  contribute  to  advancing  the  recovery  line. 
Third,  uncoordinated  checkpointing  forces  each  process  to  maintain  multiple  checkpoints,  and  to  invoke  periodically 
a  garbage  collection  algorithm  to  reclaim  the  checkpoints  that  are  no  longer  useful.  Fourth,  it  is  not  suitable  for 
applications  with  frequent  output  commits  because  these  require  global  coordination  to  compute  the  recovery  line, 
negating  much  of  the  advantage  of  autonomy. 

In  order  to  determine  a  consistent  global  checkpoint  during  recovery,  the  processes  record  the  dependencies 
among  their  checkpoints  during  failure-free  operation  using  the  following  technique  [9].  Let  be  the  y* 
checkpoint  of  process  P,.  We  call  x  the  checkpoint  index.  Let  7/ v  denote  the  checkpoint  interval  or  simply  intei'val 
between  checkpoints  and  Cj^y.  As  illustrated  in  Figure  7,  if  process  P,  at  interval  7/^,.  sends  a  message  m  to  Pj,  it 

will  piggyback  the  pair  (/>)  on  m.  When  Pj  receives  m  during  interval  Ij^y,  it  records  the  dependency  from  7/^,.  to  7y 
which  is  later  saved  onto  stable  storage  when  Pj  takes  checkpoint  y 

If  a  failure  occurs,  the  recovering  process  initiates  rollback  by  broadcasting  a  dependency  request  message  to 
collect  all  the  dependency  information  maintained  by  each  process.  When  a  process  receives  this  message,  it  stops 
its  execution  and  replies  with  the  dependency  information  saved  on  stable  storage  as  well  as  with  the  dependency 
information,  if  any,  which  is  associated  with  its  current  state.  The  initiator  then  calculates  the  recovery  line  based  on 
the  global  dependency  information  and  broadcasts  a  rollback  request  message  containing  the  recovery  line.  Upon 
receiving  this  message,  a  process  whose  current  state  belongs  to  the  recovery  line  simply  resumes  execution, 
otherwise  it  rolls  back  to  an  earlier  checkpoint  as  indicated  by  the  recovery  line. 

!◄ - 4v - M 


Figure  7.  Checkpoint  index  and  checkpoint  interval. 
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3.1,2  Dependency  Graphs  and  Recovery  Line  Calculation 

There  are  two  approaches  proposed  in  the  literature  to  determine  the  recovery  line  in  checkpoint-based  recovery. 
The  first  approach  is  based  on  a  rollback’dependency  graph  [9]  in  which  each  node  represents  a  checkpoint  and  a 
directed  edge  is  drawn  from  to  q.v  if  either: 

(1)  /  ^  y,  and  a  message  m  is  sent  from  //^v  ^nd  received  in  or 

(2)  /  =  j  and  y  =  x+  1. 

The  name  “rollback-dependency  graph”  comes  from  the  observation  that  if  there  is  an  edge  ifom  to  Oo 
and  a  failure  forces  7/^,-  to  be  rolled  back,  then  must  also  be  rolled  back. 

Figure  8(b)  shows  the  rollback  dependency  graph  for  the  execution  in  Figure  8(a).  The  algorithm  used  to 
compute  the  recovery  line  first  marks  the  graph  nodes  corresponding  to  the  states  of  processes  Po  and  Pi  at  the 
failure  point  (shown  in  figure  in  dark  ellipses).  It  then  uses  reachability  analysis  [9]  to  mark  all  reachable  nodes 
from  any  of  the  initially  marked  nodes.  The  union  of  the  last  unmarked  nodes  over  the  entire  system  forms  the 
recovery  line,  as  shown  in  Figure  8(b). 

The  second  approach  is  based  on  the  checkpoint  graph  [59].  Checkpoint  graphs  are  similar  to  rollback- 
dependency  graphs  except  that,  when  a  message  is  sent  from  7/ and  received  in  7,  ,,,  a  directed  edge  is  drawn  from 
C/^v-i  to  Cj^y  (instead  of  to  C/.v),  as  shown  in  Figure  8(c).  The  recovery  line  can  be  calculated  by  first  removing  both 
the  nodes  corresponding  to  the  states  of  the  failed  processes  at  the  point  of  failures  and  the  edges  incident  on  them, 
and  then  applying  the  rollback  propagation  algorithm  [59]  on  the  checkpoint  graph  as  shown  in  Figure  9.  Both  the 
rollback-dependency  graph  and  the  checkpoint  graph  approaches  are  equivalent,  in  that  they  always  produce  the 
same  recovery  line  (as  indeed  they  do  in  the  example). 


Co,o  Co,,  Co, 2 
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Figure  8.  (a)  Example  execution;  (b)  rollback-dependency  graph;  (c)  checkpoint  graph. 
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3.1.3  Garbage  Collection 

Any  checkpoint  that  precedes  the  recovery  lines  for  all  possible  combinations  of  process  failures  can  be  garbage- 
collected.  The  garbage  collection  algorithm  based  on  a  rollback  dependency  graph  works  by  identifying  the 
obsolete  checkpoints  as  follows.  First,  it  marks  all  volatile  checkpoints  and  removes  all  edges  ending  in  a  marked 
checkpoint,  producing  a  non-volatile  rollback  dependency  graph  [63].  Then,  it  uses  reachability  analysis  to 
determine  the  worst-case  recovery  line  for  this  graph,  called  the  global  recovery  line.  Figure  10  shows  the  non¬ 
volatile  rollback-dependency  graph  and  the  global  recovery  line  of  Figure  8(a).  In  this  case,  only  the  first 
checkpoint  of  each  process  is  obsolete  and  can  be  garbage-collected.  As  the  figure  illustrates,  when  the  global 
recovery  line  is  unable  to  advance  because  of  rollback  propagation,  a  large  number  of  non-obsolete  checkpoints  may 
need  to  be  retained. 

By  deriving  the  necessary  and  sufficient  conditions  for  a  checkpoint  to  be  useful  for  any  future  recovery,  it  is 
possible  to  derive  an  optimal  garbage  collection  algorithm  that  reduces  the  number  of  retained  checkpoints  [62]. 
The  algorithm  determines  a  set  of  N  recovery  lines,  the  union  of  which  contains  all  useful  checkpoints.  Each  of  the 
N  recovery  lines  is  obtained  by  initially  marking  one  volatile  checkpoint  in  the  non-volatile  rollback-dependency 
graph.  For  the  graph  in  Figure  10,  the  optimal  algorithm  identifies  the  four  checkpoints  A,5,C  and  D  as  well  as  the 
four  obsolete  checkpoints  as  garbage  checkpoints.  The  number  of  useful  checkpoints  has  a  tight  upper  bound 
of  A(A^+l)/2  [62]. 

3.2  Coordinated  Checkpointing 
3.2.1  Overview 

Coordinated  checkpointing  requires  processes  to  orchestrate  their  checkpoints  in  order  to  form  a  consistent  global 
state.  Coordinated  checkpointing  simplifies  recovery  and  is  not  susceptible  to  the  domino  effect,  since  every 
process  always  restarts  from  its  most  recent  checkpoint.  Also,  coordinated  checkpointing  requires  each  process  to 
maintain  only  one  permanent  checkpoint  on  stable  storage,  reducing  storage  overhead  and  eliminating  the  need  for 
garbage  collection.  Its  main  disadvantage,  however,  is  the  large  latency  involved  in  committing  output,  since  a 
global  checkpoint  is  needed  before  output  can  be  committed  to  the  outside  world. 

A  straightforward  approach  to  coordinated  checkpointing  is  to  block  communications  while  the 
checkpointing  protocol  executes  [57].  A  coordinator  takes  a  checkpoint  and  broadcasts  a  request  message  to  all 
processes,  asking  them  to  take  a  checkpoint.  When  a  process  receives  this  message,  it  stops  its  execution,  flushes  all 
the  communication  channels,  takes  a  tentative  checkpoint,  and  sends  an  acknowledgment  message  back  to  the 
coordinator.  After  the  coordinator  receives  acknowledgments  from  all  processes,  it  broadcasts  a  commit  message 
that  completes  the  two-phase  checkpointing  protocol.  After  receiving  the  commit  message,  each  process  removes 
the  old  permanent  checkpoint  and  atomically  makes  the  tentative  checkpoint  permanent.  The  process  is  then  free  to 
resume  execution  and  exchange  messages  with  other  processes.  This  straightforward  approach,  however,  can  result 
in  large  overhead,  and  therefore  non-blocking  checkpointing  schemes  are  preferable  [20]. 


include  last  checkpoint  of  each  failed  process  as  an  element  in  set  RootSev, 

include  current  state  of  each  surviving  process  as  an  element  in  RootSet\ 

mark  all  checkpoints  reachable  by  following  at  least  one  edge  from  any  member  of  RootSet\ 

while  (at  least  one  member  of  RootSet  is  marked) 

replace  each  marked  element  in  RootSet  by  the  last  unmarked  checkpoint  of  the 
same  process; 

mark  all  checkpoints  reachable  by  following  at  least  one  edge  from  any  member 
of  RootSet 

endwhile 

RootSet  is  the  recovery  line. 


Figure  9.  The  rollback  propagation  algorithm. 
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Figure  10.  Garbage  collection  based  on  global  recovery  line  and  obsolete  checkpoints. 


3.2.2  Non-blocking  Checkpoint  Coordination 

A  fundamental  problem  in  coordinated  checkpointing  is  to  prevent  a  process  from  receiving  application  messages 
that  could  make  the  checkpoint  inconsistent.  Consider  the  example  in  Figure  11(a),  in  which  message  m  is  sent  by 
Fo  after  receiving  a  checkpoint  request  from  the  checkpoint  coordinator.  Now,  assume  that  reaches  Fi  before  the 
checkpoint  request.  This  situation  results  in  an  inconsistent  checkpoint  since  checkpoint  shows  the  receipt  of 
message  m  from  Po,  while  checkpoint  does  not  show  it  being  sent  from  Fo-  If  channels  are  FIFO,  this  problem 
can  be  avoided  by  preceding  the  first  post-checkpoint  message  on  each  channel  by  a  checkpoint  request,  and  forcing 
each  process  to  take  a  checkpoint  upon  receiving  the  first  checkpoint-request  message,  as  illustrated  in  Figure  11(b). 
An  example  of  a  non-blocking  checkpoint  coordination  protocol  using  this  idea  is  the  distributed  snapshot  [16],  in 
which  markers  play  the  role  of  the  checkpoint-request  messages.  In  this  protocol,  the  initiator  takes  a  checkpoint 
and  broadcasts  a  marker  (a  checkpoint  request)  to  all  processes.  Each  process  takes  a  checkpoint  upon  receiving  the 
first  marker  and  rebroadcasts  the  marker  to  all  processes  before  sending  any  application  message.  The  protocol 
works  assuming  the  channels  are  reliable  and  FIFO.  If  the  channels  are  non-FIFO,  the  marker  can  be  piggybacked 
on  every  post-checkpoint  message  as  in  Figure  1 1(c)  [33].  Alternatively,  checkpoint  indices  can  serve  the  same  role 
as  markers,  where  a  checkpoint  is  triggered  when  the  receiver’s  local  checkpoint  index  is  lower  than  the 
piggybacked  checkpoint  index  [20]  [5  2]. 

3.2.3  Synchronized  Checkpoint  Clocks 

Loosely  synchronized  clocks  can  facilitate  checkpoint  coordination  [58].  More  specifically,  loosely  synchronized 
clocks  can  trigger  the  local  checkpointing  actions  of  all  participating  processes  at  approximately  the  same  time 
without  a  checkpoint  initiator.  A  process  takes  a  checkpoint  and  waits  for  a  period  that  equals  the  sum  of  the 
maximum  deviation  between  clocks  and  the  maximum  time  to  detect  a  failure  in  another  process  in  the  system.  The 
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Figure  11.  Non-blocking  coordinated  checkpointing:  (a)  checkpoint  inconsistency;  (b)  with  FIFO 
channels;  (c)  non-FIFO  channels  (the  short  dashed  line  represents  a  piggybacked  checkpoint  request). 
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process  can  be  assured  that  all  checkpoints  belonging  to  the  same  coordination  session  have  been  taken  without  the 
need  of  exchanging  any  messages.  If  a  failure  occurs,  it  is  detected  within  the  specified  time  and  the  protocol  is 
aborted.  To  guarantee  checkpoint  consistency,  either  the  sending  of  messages  is  blocked  for  the  duration  of  the 
protocol,  or  checkpoint  indices  are  piggybacked  to  avoid  blocking  as  explained  above. 

3.2.4  Minimal  Checkpoint  Coordination 

Coordinated  checkpointing  requires  all  processes  to  participate  in  every  checkpoint.  This  requirement  generates 
valid  concerns  about  its  scalability.  It  is  desirable  to  reduce  the  number  of  processes  involved  in  a  coordinated 
checkpointing  session.  This  can  be  done  since  only  those  processes  that  have  communicated  with  the  checkpoint 
initiator  either  directly  or  indirectly  since  the  last  checkpoint  need  to  take  new  checkpoints  [32]. 

The  following  two-phase  protocol  achieves  minimal  checkpoint  coordination  [32].  During  the  first  phase, 
the  checkpoint  initiator  identifies  all  processes  with  which  it  has  communicated  since  the  last  checkpoint  and  sends 
them  a  request.  Upon  receiving  the  request,  each  process  in  turn  identifies  all  processes  it  has  communicated  with 
since  the  last  checkpoints  and  sends  them  a  request,  and  so  on,  until  no  more  processes  can  be  identified.  During  the 
second  phase,  all  processes  identified  in  the  first  phase  take  a  checkpoint.  The  result  is  a  consistent  checkpoint  that 
involves  only  the  participating  processes.  In  this  protocol,  after  a  process  takes  a  checkpoint,  it  cannot  send  any 
message  until  the  second  phase  terminates  successfully,  although  receiving  a  message  after  the  checkpoint  has  been 
taken  is  allowable. 

3.3  Communication-induced  Checkpointing 

3.3.1  Overview 

Communication-induced  checkpointing  avoids  the  domino  effect  while  allowing  processes  to  take  some  of  their 
checkpoints  independently  [14].  However,  process  independence  is  constrained  to  guarantee  the  eventual  progress 
of  the  recovery  line,  and  therefore  processes  may  be  forced  to  take  additional  checkpoints.  The  checkpoints  that  a 
process  takes  independently  are  called  local  checkpoints,  while  those  that  a  process  is  forced  to  take  are  called 
forced  checkpoints.  Communication-induced  checkpointing  piggybacks  protocol-related  information  on  each 
application  message.  The  receiver  of  each  application  message  uses  the  piggybacked  information  to  determine  if  it 
has  to  take  a  forced  checkpoint  to  advance  the  global  recovery  line.  The  forced  checkpoint  must  be  taken  before  the 
application  may  process  the  contents  of  the  message,  possibly  incurring  high  latency  and  overhead.  It  is  therefore 
desirable  in  these  systems  to  reduce  the  number  of  forced  checkpoints  to  the  extent  possible.  In  contrast  with 
coordinated  checkpointing,  no  special  coordination  messages  are  exchanged. 

We  distinguish  two  types  of  communication-induced  checkpointing.  In  model-based  checkpointing,  the 
system  maintains  checkpoint  and  communication  structures  that  prevent  the  domino  effect  or  achieve  some  even 
stronger  properties  [60].  In  index-based  coordination,  the  system  uses  an  indexing  scheme  for  the  local  and  forced 
checkpoints,  such  that  the  checkpoints  of  the  same  index  at  all  processes  form  a  consistent  state.  Recently,  it  has 
been  proved  that  the  two  types  are  fundamentally  equivalent  [25],  although  in  practice,  there  may  be  some  evidence 
that  index-based  coordination  results  in  fewer  forced  checkpoints  [2].  Other  practical  issues  concerning  these 
protocols  will  be  discussed  in  Section  5. 

3.3.2  Model-based  Checkpointing 

Model-based  checkpointing  relies  on  preventing  patterns  of  communications  and  checkpoints  that  could  result  in 
inconsistent  states  among  the  existing  checkpoints.  A  model  is  set  up  to  detect  the  possibility  that  such  patterns 
could  be  forming  within  the  system,  according  to  some  heuristic.  A  checkpoint  is  usually  forced  to  prevent  the 
undesirable  patterns  from  occurring.  Model-based  checkpointing  works  with  the  restriction  that  no  control  (out-of- 
band)  messages  are  exchanged  among  the  processes  during  normal  operation.  All  information  necessary  to  execute 
the  protocol  is  piggybacked  on  top  of  application  messages.  The  decision  to  force  a  checkpoint  is  done  locally  using 
the  information  available.  Therefore,  under  this  style  of  checkpointing  it  is  possible  that  two  processes  detect  the 
potential  for  inconsistent  checkpoints  and  independently  force  local  checkpoints  to  prevent  the  formation  of 
undesirable  patterns  that  may  never  actually  materialize  or  that  could  be  prevented  by  a  single  forced  checkpoint. 
Thus,  this  style  of  checkpointing  always  errs  on  the  conservative  side  by  taking  more  forced  checkpoints  than  is 
probably  necessary,  because  without  explicit  coordination,  no  process  has  complete  information  about  the  global 
system  state. 
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The  literature  contains  several  domino-effect-free  checkpoint  and  communication  models.  The  MRS  model 
[50]  avoids  the  domino  effect  by  ensuring  that  within  every  checkpoint  interval  all  message-receiving  events 
precede  all  message-sending  events.  This  model  can  be  maintained  by  taking  an  additional  checkpoint  before  every 
message-receiving  event  that  is  not  separated  from  its  previous  message-sending  event  by  a  checkpoint  [60]. 
Another  way  to  prevent  the  domino  effect  is  to  avoid  rollback  propagation  completely  by  taking  a  checkpoint 
immediately  after  every  message-sending  event  [7].  Recent  work  has  focused  on  ensuring  that  every  checkpoint  can 
belong  to  a  consistent  global  checkpoint  and  therefore  is  not  useless  [5][24][25][41]. 

3.3.3  Index-based  Communication-Induced  Checkpointing 

Index-based  communication-induced  checkpointing  works  by  assigning  monotonically  increasing  indexes  to 
checkpoints,  such  that  the  checkpoints  having  the  same  index  at  different  processes  form  a  consistent  state  [14].  The 
indexes  are  piggybacked  on  application  messages  to  help  receivers  decide  when  they  should  force  a  checkpoint.  For 
instance,  the  protocol  by  Briatico  et  al  forces  a  process  to  take  a  checkpoint  upon  receiving  a  message  with  a 
piggybacked  index  greater  than  the  local  index  [14].  More  sophisticated  protocols  piggyback  more  information  on 
top  of  application  messages  to  minimize  the  number  of  forced  checkpoints  [24]. 

4  Log-Based  Rollback  Recovery 

As  opposed  to  checkpoint-based  rollback  recovery,  log-based  rollback  recovery  makes  explicit  use  of  the  fact  that  a 
process  execution  can  be  modeled  as  a  sequence  of  deterministic  state  intervals,  each  starting  with  the  execution  of  a 
nondeterministic  event  [56].  Such  an  event  can  be  the  receipt  of  a  message  from  another  process  or  an  event  internal 
to  the  process.  Sending  a  message,  however,  is  not  a  nondeterministic  event.  For  example,  in  Figure  5,  the 
execution  of  process  Pq  is  a  sequence  of  four  deterministic  intervals.  The  first  one  starts  with  the  creation  of  the 
process,  while  the  remaining  three  start  with  the  receipt  of  messages  mo,  m3,  and  m?,  respectively.  Sending  message 
m2  is  uniquely  determined  by  the  initial  state  of  Pq  and  by  the  receipt  of  message  mo,  and  is  therefore  not  a 
nondeterministic  event. 

Log-based  rollback  recovery  assumes  that  all  nondeterministic  events  can  be  identified  and  their 
corresponding  determinants  can  be  logged  to  stable  storage.  During  failure-free  operation,  each  process  logs  the 
determinants  of  all  the  nondeterministic  events  that  it  observes  onto  stable  storage.  Additionally,  each  process  also 
takes  checkpoints  to  reduce  the  extent  of  rollback  during  recovery.  After  a  failure  occurs,  the  failed  processes 
recover  by  using  the  checkpoints  and  logged  determinants  to  replay  the  corresponding  nondeterministic  events 
precisely  as  they  occurred  during  the  pre-failure  execution.  Because  execution  within  each  deterministic  interval 
depends  only  on  the  sequence  of  nondeterministic  events  that  preceded  the  interval’s  beginning,  the  pre-failure 
execution  of  a  failed  process  can  be  reconstructed  during  recovery  up  to  the  first  nondeterministic  event  whose 
determinant  is  not  logged. 

Log-based  rollback-recovery  protocols  have  been  traditionally  called  ‘‘message  logging  protocols.”  The 
association  of  nondeterministic  events  with  messages  is  rooted  in  the  earliest  systems  that  proposed  and 
implemented  this  style  of  recovery  [7][11][56].  These  systems  translated  nondeterministic  events  into  deterministic 
message  receipt  events. 

Log-based  rollback-recovery  protocols  guarantee  that  upon  recovery  of  all  failed  processes,  the  system  does 
not  contain  any  orphan  process,  i.e.,  a  process  whose  state  depends  on  a  nondeterministic  event  that  cannot  be 
reproduced  during  recovery.  The  way  in  which  a  specific  protocol  implements  this  condition  affects  the  protocol’s 
failure-free  performance  overhead,  latency  of  output  commit,  and  simplicity  of  recovery  and  garbage  collection,  as 
well  as  its  potential  for  rolling  back  correct  processes.  There  are  three  flavors  of  these  protocols: 

•  Pessimistic  log-based  rollback-recovery  protocols  guarantee  that  orphans  are  never  created  due  to  a  failure. 
These  protocols  simplify  recovery,  garbage  collection  and  output  commit,  at  the  expense  of  higher  failure-free 
performance  overhead. 

•  Optimistic  log-based  rollback-recovery  protocols  reduce  the  failure-free  performance  overhead,  but  allow 
orphans  to  be  created  due  to  failures.  The  possibility  of  having  orphans  complicates  recovery,  garbage 
collection  and  output  commit. 

•  Causal  log-based  rollback-recovery  protocols  attempt  to  combine  the  advantages  of  low  performance  overhead 
and  fast  output  commit,  but  they  may  require  complex  recovery  and  garbage  collection. 


13 


We  present  log^based  rollback-recovery  protocols  by  first  specifying  a  property  that  guarantees  that  no 
orphans  are  created  during  an  execution,  and  then  by  discussing  how  the  three  major  classes  of  log-based  rollback- 
recovery  protocols  implement  this  consistency  condition. 

4.1  The  No-Orphans  Consistency  Condition 

Let  ^  be  a  nondeterministic  event  that  occurs  at  process  p,  we  define: 

•  Dependie),  the  set  of  processes  that  are  affected  by  a  nondeterministic  event  e.  This  set  consists  of  p,  and  any 
process  whose  state  depends  on  the  event  e  according  to  Lamport’s  happened  before  relation  [34]. 

•  Log{e),  the  set  of  processes  that  have  logged  a  copy  of  determinant  in  their  volatile  memory. 

•  Stable{e),  a  predicate  that  is  true  if  e's  determinant  is  logged  on  stable  storage. 

Now,  suppose  that  a  set  of  processes  \|/  crashes.  A  process  p  in  \|/  becomes  an  orphan  when  p  itself  does  not 
fail  and  p’s  state  depends  on  the  execution  of  a  nondeterministic  event  e  whose  determinant  cannot  be  recovered 
from  stable  storage  or  from  the  volatile  memory  of  a  surviving  process.  Formally  [1]: 

Ve: -iStable(e)  => Depend(e)  ^Log(e) 

We  call  this  property  the  always-no-orphans  condition.  It  stipulates  that  if  any  surviving  process  depends  on 
an  event  e,  that  either  the  event  is  logged  on  stable  storage,  or  the  process  has  a  copy  of  the  determinant  of  event  e. 
If  neither  condition  is  true,  then  the  process  is  an  orphan  because  it  depends  on  an  event  e  that  cannot  be  generated 
during  recovery  since  its  determinant  has  been  lost. 

4.2  Pessimistic  Logging 

4.2.1  Overview 

Pessimistic  logging  protocols  are  designed  under  the  assumption  that  a  failure  can  occur  after  any  nondeterministic 
event  in  the  computation.  This  assumption  is  “pessimistic”  since  in  reality  failures  are  rare.  In  their  most 
straightforward  form,  pessimistic  protocols  log  to  stable  storage  the  determinant  of  each  nondeterministic  event 
before  the  event  is  allowed  to  affect  the  computation.  These  pessimistic  protocols  implement  the  following 
property,  often  referred  to  as  synchronous  logging,  which  is  a  strengthening  of  the  always-no-orphans  condition: 

Ve:  -iStable(e)  =>  jDepend(e) /  -  0 

This  property  stipulates  that  if  an  event  has  not  been  logged  on  stable  storage,  then  no  process  can  depend 

on  it. 

In  addition  to  logging  determinants,  processes  also  take  periodic  checkpoints  to  limit  the  amount  of  work  that 
has  to  be  repeated  in  execution  replay  during  recovery.  Should  a  failure  occur,  the  application  program  is  restarted 
from  the  most  recent  checkpoint  and  the  logged  determinants  are  used  during  recovery  to  recreate  the  pre-failure 
execution. 

Consider  the  example  in  Figure  12.  During  failure-free  operation  the  logs  of  processes  Pq,  Pi  and  P2  contain 
the  determinants  needed  to  replay  messages  {wo,  mj],  [mi,  m^,  m^}  and  {m2,  W5},  respectively.  Suppose 
processes  P 1  and  P 2  fail  as  shown,  restart  from  checkpoints  B  and  C,  and  roll  forward  using  their  determinant  logs  to 
deliver  again  the  same  sequence  of  messages  as  in  the  pre-failure  execution.  This  guarantees  that  Px  and  P2  will 
repeat  exactly  their  pre-failure  execution  and  re-send  the  same  messages.  Hence,  once  recovery  is  complete,  both 
processes  will  be  consistent  with  the  state  of  Pq  that  includes  the  receipt  of  message  m?  from  Px. 

In  a  pessimistic  logging  system,  the  observable  state  of  each  process  is  always  recoverable.  This  property 
has  four  advantages: 

1.  Processes  can  commit  output  to  the  outside  world  without  running  a  special  protocol. 

2.  Processes  restart  from  their  most  recent  checkpoint  upon  a  failure,  therefore  limiting  the  extent  of  execution  that 
has  to  be  replayed.  Thus,  the  frequency  of  checkpoints  can  be  determined  by  trading  off  the  desired  runtime 
performance  with  the  desired  protection  of  the  on-going  execution. 

3.  Recovery  is  simplified  because  the  effects  of  a  failure  are  confined  only  to  the  processes  that  fail.  Functioning 
processes  continue  to  operate  and  never  become  orphans  because  a  process  always  recovers  to  the  state  that 
included  its  most  recent  interaction  with  any  other  process  or  with  the  outside  world.  This  is  highly  desirable  in 
practical  systems  [27]. 
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Figure  12.  Pessimistic  logging. 


4.  Recovery  information  can  be  garbage-collected  easily.  Older  checkpoints  and  determinants  of  nondeterministic 
events  that  occurred  before  the  most  recent  checkpoint  can  be  reclaimed  because  they  will  never  be  needed  for 
recovery. 

The  price  to  be  paid  for  these  advantages  is  a  performance  penalty  incurred  by  synchronous  logging. 
Implementations  of  pessimistic  logging  must  therefore  resort  to  special  techniques  to  reduce  the  effects  of 
synchronous  logging  on  performance.  Some  protocols  rely  on  special  hardware  to  facilitate  logging  [11],  while 
others  may  limit  the  number  of  tolerated  failures  to  improve  performance  [29]  [31]. 

4.2.2  Techniques  for  Reducing  Performance  Overhead 

Synchronous  logging  [11]  can  potentially  result  in  a  high  performance  overhead.  This  overhead  can  be  lowered 
using  special  hardware.  For  example,  fast  non-volatile  semiconductor  memory  can  be  used  to  implement  stable 
storage  [6].  Synchronous  logging  in  such  an  implementation  is  orders  of  magnitude  cheaper  than  with  a  traditional 
implementation  of  stable  storage  that  uses  magnetic  disk  devices.  Another  form  of  hardware  support  uses  a  special 
bus  to  guarantee  atomic  logging  of  all  messages  exchanged  in  the  system  [11].  Such  hardware  support  ensures  that 
the  log  of  one  machine  is  automatically  stored  on  a  designated  backup  without  blocking  the  execution  of  the 
application  program.  This  scheme,  however,  requires  that  all  nondeterministic  events  be  converted  into  external 
messages  [7][11]. 

Some  pessimistic  logging  systems  reduce  the  overhead  of  synchronous  logging  without  relying  on  hardware. 
For  example,  the  Sender-Based  Message  Logging  (SBML)  protocol  keeps  the  determinants  corresponding  to  the 
delivery  of  each  message  m  in  the  volatile  memory  of  its  sender  [29],  The  determinant  of  m,  which  consists  of  its 
content  and  the  order  in  which  it  was  delivered,  is  logged  in  two  steps.  First,  before  sending  nu  the  sender  logs  its 
content  in  volatile  memory.  Then,  when  the  receiver  of  m  responds  with  an  acknowledgment  that  includes  the  order 
in  which  the  message  was  delivered,  the  sender  adds  to  the  determinant  the  ordering  information.  SBML  avoids  the 
overhead  of  accessing  stable  storage  but  tolerates  only  one  failure  and  cannot  handle  nondeterministic  events 
internal  to  a  process.  Extensions  to  this  technique  can  tolerate  more  than  one  failure  in  special  network 
topologies  [31]. 

4.2.3  Relaxing  Logging  Atomicity 

The  performance  overhead  of  pessimistic  logging  can  be  reduced  by  delivering  a  message  or  an  event  and  deferring 
its  logging  until  the  host  communicates  with  another  host  or  with  the  outside  world  [28] [29].  In  the  example  of 
Figure  12,  process  Pq  may  defer  the  logging  of  messages  and  mq  until  it  communicates  with  another  process  or 
the  outside  world.  Process  Po  implements  the  following  weaker  property,  which  still  guarantees  the  always-no- 
orphans  condition: 

Ve:  -iStable(e)  =>  /Depend(e)  j  <1 

This  property  relaxes  the  condition  of  pessimistic  logging  by  allowing  a  single  process  to  be  affected  by  an 
event  that  has  yet  to  be  logged,  provided  that  the  process  does  not  externalize  the  effect  of  this  dependency  to  other 
processes  or  the  outside  world.  Thus,  messages  m4  and  mv  are  allowed  to  affect  process  Po,  but  this  effect  is  local  - 
no  other  process  or  the  outside  world  can  see  it  until  the  messages  are  logged. 
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The  obseiyed  behavior  of  each  process  is  the  same  as  with  an  implementation  that  logs  events  before 
delivering  them  to  applications.  Event  logging  and  delivery  are  not  performed  in  one  atomic  operation  in  this 
variation  of  pessimistic  logging.  This  scheme  reduees  overhead  because  several  events  can  be  logged  in  one 
operation,  reducing  the  frequency  of  synchronous  access  to  stable  storage.  Latency  of  interprocess  communication 
and  output  commit  are  not  reduced  since  a  logging  operation  may  often  be  needed  before  sending  a  message. 

Systems  that  separate  logging  of  an  event  from  its  delivery  may  lose  the  last  messages  delivered  before  a 
failure.  This  may  be  a  problem  for  applications  that  assume  that  processes  communicate  through  reliable  channels. 
Consider  one  of  these  applications  going  through  the  execution  shown  in  Figure  12,  and  assume  that  process  Pq  fails 
after  delivering  messages  m4  and  but  before  the  corresponding  determinants — containing  the  content  and  order  of 
receipt  of  the  messages — ^are  logged.  Protocols  in  which  the  receiver  logs  the  message  content  cannot  guarantee  that 
the  recovered  Po  will  ever  deliver  nu  and  violating  the  assumption  about  reliable  channels.  This  problem  does 
not  arise  in  protocols  that  log  messages  at  the  sender  or  do  not  assume  reliable  communication  channels 
[18][28][29][30]. 

4.3  Optimistic  Logging 

4.3.1  Overview 

In  optimistic  logging  protocols,  processes  log  determinants  asynchronously  to  stable  storage  [56].  These  protocols 
make  the  optimistic  assumption  that  logging  will  complete  before  a  failure  occurs.  Determinants  are  kept  in  a 
volatile  log,  which  is  periodically  flushed  to  stable  storage.  Thus,  optimistic  logging  does  not  require  the  application 
to  block  waiting  for  the  determinants  to  be  actually  written  to  stable  storage,  and  therefore  incurs  little  overhead 
during  failure-free  execution.  However,  this  advantage  comes  at  the  expense  of  more  complicated  recovery, 
garbage  collection,  and  slower  output  commit  than  in  pessimistic  logging.  If  a  process  fails,  the  determinants  in  its 
volatile  log  will  be  lost,  and  the  state  intervals  that  were  started  by  the  nondeterministic  events  corresponding  to 
these  determinants  cannot  be  recovered.  Furthermore,  if  the  failed  process  sent  a  message  during  any  of  the  state 
intervals  that  cannot  be  recovered,  the  receiver  of  the  message  becomes  an  orphan  process  and  must  roll  back  to 
undo  the  effects  of  receiving  the  message.  Optimistic  protocols  do  not  implement  the  always-no-orphans  condition, 
and  therefore  permit  the  temporary  creation  of  orphan  processes.  However,  they  require  that  the  property  holds  by 
the  time  recovery  is  complete.  This  is  achieved  during  recovery  by  rolling  back  orphan  processes  until  their  states 
do  not  depend  on  any  message  whose  determinant  has  been  lost.  For  example,  suppose  process  Pj  in  Figure  13  fails 
before  the  determinant  for  is  logged  to  stable  storage.  Process  Pi  then  becomes  an  orphan  process  and  must  roll 
back  to  undo  the  effects  of  receiving  the  orphan  message  me.  The  rollback  of  P,  further  forces  Po  to  roll  back  to 
undo  the  effects  of  receiving  message  mj. 

To  perform  these  rollbacks  correctly,  optimistic  logging  protocols  track  causal  dependencies  during  failure- 
free  execution.  Upon  a  failure,  the  dependency  information  is  used  to  calculate  and  recover  the  latest  global  state  of 
the  pre-failure  execution  in  which  no  process  is  in  an  orphan. 

The  above  example  also  illustrates  why  optimistic  logging  protocols  require  a  nontrivial  garbage  collection 
algorithm.  While  pessimistic  protocols  need  only  keep  the  most  recent  checkpoint  of  each  process,  optimistic 
protocols  may  need  to  keep  multiple  checkpoints.  In  the  example,  the  failure  of  Pj  forces  P,  to  restart  from 
checkpoint  B  instead  of  its  most  recent  checkpoint  D. 

Finally,  since  determinants  are  logged  asynchronously,  output  commit  in  optimistic  logging  protocols 
generally  requires  multi-host  coordination  to  ensure  that  no  failure  scenario  can  revoke  the  output.  For  example,  if 
process  P,,  needs  to  conmiit  output  at  state  X,  it  must  log  messages  ^4  and  W7  to  stable  storage  and  ask  P2  to  log  m2 
and  nis- 

4.3.2  Synchronous  vs.  Asynchronous  Recovery 

Recovery  in  optimistic  logging  protocols  can  be  either  synchronous  or  asynchronous.  In  synchronous  recovery 
[28][30][53],  all  processes  run  a  recovery  protocol  to  compute  the  maximum  recoverable  system  state  based  on 
dependency  and  logged  information,  and  then  perform  the  actual  rollbacks.  During  failure-free  execution,  each 
process  increments  a  state  interval  index  at  the  beginning  of  each  state  interval.  Dependency  tracking  can  be  either 
direct  or  transitive. 

In  direct  dependency  tracking  [28]  [53],  the  state  interval  index  of  the  sender  is  piggybacked  on  each  outgoing 
message  to  allow  the  receiver  to  record  the  dependency  directly  caused  by  the  message.  These  direct  dependencies 
can  then  be  assembled  at  recovery  time  to  obtain  complete  dependency  information.  Alternatively,  transitive 
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Figure  13.  Optimistic  logging. 


dependency  tracking  [53][56]  can  be  used:  each  process  Pi  maintains  a  size-A^  vector  TD,,  where  7Z)/[/]  is  P/’s 
current  state  interval  index,  and  TD-^f],  j  ^  i,  records  the  highest  index  of  any  state  interval  of  Pj  on  which  P/ 
depends.  Transitive  dependency  tracking  generally  incurs  a  higher  failure-free  overhead  for  piggybacking  and 
maintaining  the  dependency  vectors,  but  allows  faster  output  commit  and  recovery. 

In  asynchronous  recovery,  a  failed  process  restarts  by  sending  a  rollback  announcement  broadcast  or  a 
recovery  message  broadcast  to  start  a  new  incarnation  [5 5] [5 6].  Upon  receiving  a  rollback  announcement,  a  process 
rolls  back  if  it  detects  that  it  has  become  an  orphan  with  respect  to  that  announcement,  and  then  broadcasts  its  own 
rollback  announcement.  Since  rollback  announcements  from  multiple  incarnations  of  the  same  process  may  coexist 
in  the  system,  each  process  in  general  needs  to  track  the  dependency  of  its  state  on  every  incarnation  of  all  processes 
to  correctly  detect  orphaned  states.  A  way  to  limit  dependency  tracking  to  only  one  incarnation  of  each  process  is  to 
force  a  process  to  delay  its  delivery  of  certain  messages.  That  is,  before  a  process  P,  can  deliver  any  message 
carrying  a  dependency  on  an  unknown  incarnation  of  process  Pj,  Pj  must  first  receive  rollback  announcements  from 
Pj  to  verify  that  P/’s  current  state  does  not  depend  on  any  invalid  state  of  P/s  previous  incarnations.  Piggybacking 
all  rollback  announcements  known  to  a  process  on  every  outgoing  message  can  eliminate  blocking,  and  the  amount 
of  piggybacked  information  can  be  further  reduced  to  a  provable  minimum  [55]. 

Another  issue  in  asynchronous  recovery  protocols  is  the  possibility  of  exponential  rollbacks.  This 
phenomenon  occurs  if  a  single  failure  causes  a  process  to  roll  back  an  exponential  number  of  times  [53].  Figure  14 
gives  an  example,  where  each  integer  pair  [iyX]  represents  the  x^^  state  interval  of  the  incarnation  of  a  process. 
Suppose  Po  fails  and  loses  its  interval  [1,2].  When  Po’s  rollback  announcement  ro  reaches  Pi,  the  latter  rolls  back  to 
interval  [2,3]  and  broadcasts  another  rollback  announcement  ri.  If  ri  reaches  Pz  before  ro  does,  Pz  will  first  roll  back 
to  [4,5]  in  response  to  ri,  and  later  roll  back  again  to  [4,4]  upon  receiving  /  q.  By  generalizing  this  example,  we  can 
construct  scenarios  in  which  process  P/,  i  >  0,  rolls  back  times  in  response  to  Pq’s  failure. 

Several  approaches  have  been  proposed  to  ensure  that  any  process  will  roll  back  at  most  once  in  response  to  a 
single  failure.  By  distinguishing  failure  announcements  from  rollback  announcements  and  broadcasting  only  the 
former,  the  source  of  the  exponential-rollbacks  problem  is  eliminated  [53].  Another  possibility  is  to  piggyback  the 
original  rollback  announcement  from  the  failed  process  on  every  subsequent  rollback  announcement  that  it  triggers. 
For  example,  in  Figure  14,  process  P\  piggybacks  /  o  on  / 1.  Exponential  rollbacks  can  be  avoided  by  piggybacking  all 
rollback  announcements  on  every  application  message  [55]. 

4.4  Causal  Logging 

4.4.1  Overview 

Causal  logging  has  the  failure-free  performance  advantages  of  optimistic  logging  while  retaining  most  of  the 
advantages  of  pessimistic  logging  [1][18].  Like  optimistic  logging,  it  avoids  synchronous  access  to  stable  storage 
except  during  output  commit.  Like  pessimistic  logging,  it  allows  each  process  to  commit  output  independently  and 
never  creates  orphans,  thereby  isolating  processes  from  the  effects  of  failures  that  occur  in  other  processes. 
Furthermore,  causal  logging  limits  the  rollback  of  any  failed  process  to  the  most  recent  checkpoint  on  stable  storage. 
This  reduces  the  storage  overhead  and  the  amount  of  work  at  risk.  These  advantages  come  at  the  expense  of  a  more 
complex  recovery  protocol. 
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Figure  14.  Exponential  rollbacks. 


Causal  logging  protocols  ensure  the  always~no-orphans  property  by  ensuring  that  the  determinant  of  each 
nondeterministic  event  that  causally  precedes  the  state  of  a  process  is  either  stable  or  it  is  available  locally  to  that 
process.  Consider  the  example  in  Figure  15(a).  While  messages  ms  and  may  be  lost  upon  the  failure,  process 
at  state  X  will  have  logged  the  determinants  of  the  nondeterministic  events  that  causally  precede  its  state  according 
to  Lamport’s  happened-before  relation  [34].  These  events  consist  of  the  delivery  of  messages  wq,  mi,  nu,  m3  and  m^. 
The  determinant  of  each  of  these  nondeterministic  events  is  either  logged  on  stable  storage  or  is  available  in  the 
volatile  log  of  process  P 0.  The  determinant  of  each  of  these  events  contains  the  order  in  which  its  original  receiver 
delivered  the  corresponding  message.  The  message  sender,  as  in  sender-based  message  logging,  logs  the  message 
content.  Thus,  process  Pq  will  be  able  to  “guide”  the  recovery  of  Pi  and  Po  since  it  knows  the  order  in  which  P\ 
should  replay  messages  my  and  m3  to  reach  the  state  from  which  Py  sends  message  m^.  Similarly,  Pq  has  the  order  in 
which  P 2  should  replay  message  m2  to  be  consistent  with  both  P^  and  Py.  The  content  of  these  messages  is  obtained 
from  the  sender  log  of  Po  or  regenerated  deterministically  during  the  recovery  of  Py  and  P2.  Notice  that  information 
about  nis  and  is  not  available  anywhere.  These  messages  may  be  replayed  after  recovery  in  a  different  order,  if  at 
all.  However,  since  they  had  no  effect  on  a  surviving  process  or  the  outside  world,  the  resulting  state  is  consistent. 
The  determinant  log  kept  by  each  process  acts  as  an  insurance  to  protect  it  from  the  failures  that  occur  in  other 
processes.  It  also  allows  the  process  to  make  its  state  recoverable  by  simply  logging  the  information  available 
locally.  Thus,  a  process  does  not  need  to  run  a  multi-host  protocol  to  commit  output. 

4.4.2  Tracking  Causality 

Causal  logging  protocols  implements  the  always-no-orphans  condition  by  having  processes  piggyback  the  non¬ 
stable  determinants  in  their  volatile  log  on  the  messages  they  send  to  other  processes.  On  receiving  a  message,  a 
process  first  adds  any  piggybacked  determinant  to  its  volatile  determinant  log  and  then  delivers  the  message  to  the 
application. 

The  Manetho  system  propagates  the  causal  information  in  an  antecedence  graph  [18].  The  antecedence 
graph  provides  every  process  in  the  system  with  a  complete  history  of  the  nondeterministic  events  that  have  causal 
effects  on  its  state.  The  graph  has  a  node  representing  each  nondeterministic  event  that  precedes  the  state  of  a 
process,  and  the  edges  correspond  to  the  happened-before  relation  [34].  Figure  15(b)  shows  the  antecedence  graph 
of  process  P^  of  Figure  15(a)  at  state  X,  During  failure-free  operation,  each  process  piggybacks  on  each  application 
message  the  determinants  that  contain  the  receipt  orders  of  its  direct  and  transitive  antecedents,  i.e.,  its  local 
antecedence  graph.  The  receiver  of  the  message  records  these  receipt  orders  in  its  volatile  log. 

In  practice,  carrying  the  entire  graph  on  each  application  message  may  lead  to  an  unacceptable  overhead. 
Fortunately,  each  message  carries  a  graph  that  is  a  superset  of  the  one  piggybacked  on  the  previous  message  sent 
from  the  same  host.  This  fact  can  be  used  in  practical  implementations  to  reduce  the  amount  of  information  carried 
on  application  messages.  Thus,  any  message  between  processes  p  and  q  carries  only  the  difference  between  the 
graphs  piggybacked  on  the  previous  message  exchanged  between  these  two  hosts.  Furthermore,  if  p  has  recently 
received  a  message  from  q,  it  can  exclude  the  graph  portions  that  have  been  piggybacked  on  that  message.  Process 
q  already  has  the  information  in  these  excluded  portions,  and  therefore  transmitting  them  serves  no  purpose.  Other 
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optimizations  are  also  possible  but  depend  on  the  semantics  of  the  communication  protocol  [18].  An 
implementation  of  this  technique  shows  that  it  has  very  low  overhead  in  practice  [18]. 

Further  reduction  of  the  overhead  is  possible  if  the  system  is  willing  to  tolerate  a  number  of  failures  that  is 
less  than  the  total  number  of  processes  in  the  system.  This  observation  is  the  basis  of  Family  Based  Logging 
protocols  (FBL)  that  are  parameterized  by  the  number  of  tolerated  failures  [1].  The  basis  of  these  protocols  is  that  to 
tolerate / process  failures,  it  is  sufficient  to  log  each  nondeterministic  event  in  the  volatile  store  of/+  1  different 
hosts.  Hence,  the  predicate  Stable{e)  holds  as  soon  as  \Log{e)\  >/.  Sender-based  logging  is  used  to  support  message 
replay  during  recovery  and  determinants  are  piggybacked  on  application  messages.  However,  unlike  Manetho, 
propagation  of  information  about  an  event  stops  when  it  has  been  recorded  in  /  +  1  processes.  For  /  <  N,  FBL 
protocols  do  not  access  stable  storage  except  for  checkpointing.  Reducing  access  to  stable  storage  in  turn  reduces 
performance  overhead  and  implementation  complexity.  Applications  pay  only  the  overhead  that  corresponds  to  the 
number  of  failures  they  are  willing  to  tolerate.  An  implementation  for  the  protocol  with  /  =  1  confirms  that  the 
performance  overhead  is  very  small  [1].  The  Manetho  protocol  is  an  FBL  protocol  corresponding  to  the  case 
of/=  N. 

4.5  Comparison 

Different  rollback-recovery  protocols  offer  different  tradeoffs  with  respect  to  properties  such  as  performance 
overhead,  latency  of  output  commit,  storage  overhead,  ease  of  garbage  collection,  simplicity  of  recovery,  freedom 
from  domino  effect,  freedom  from  orphan  processes,  and  the  extent  of  rollback.  Table  1  provides  a  brief 
comparison  between  the  different  styles  of  rollback-recovery  protocols. 

Since  garbage  collection  and  recovery  both  involve  calculating  a  recovery  line,  they  can  be  performed  by 
simple  procedures  under  coordinated  checkpointing  and  pessimistic  logging,  both  of  which  have  a  predetermined 
recovery  line  during  failure-free  execution.  The  extent  of  any  potential  rollback  determines  the  maximum  number 
of  checkpoints  each  process  needs  to  retain.  Uncoordinated  checkpointing  can  have  unbounded  rollbacks,  and  a 
process  may  need  to  retain  up  to  N  checkpoints  if  the  optimal  garbage  collection  algorithm  is  used  [62].  Also, 
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Figure  15.  Causal  logging,  (a)  Maximum  recoverable  states,  and  (b)  antecedence  graph  of  Pq  at  state  X. 
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several  checkpoints  may  need  to  be  kept  under  optimistic  logging,  depending  on  the  specifics  of  the  logging  scheme. 
Note  that  we  do  not  include  failure-free  overhead  as  a  factor  in  the  comparison.  Several  studies  have  shown  that 
these  protocols  perform  reasonably  well  in  practice,  and  that  several  factors  such  as  checkpointing  frequency, 
machine  speed,  and  stable  storage  bandwidth  play  more  important  roles  than  the  fundamental  aspects  of  a  particular 
protocol  [  1  ][  1 8]  [20]  [26]  [28]  [39]  [43]  [44]  [48]  [49]  [52] . 

5  Implementation  Issues 

5.1  Overview 

While  there  is  a  rich  body  of  research  on  the  algorithmic  aspects  of  rollback-recovery  protocols,  reports  on 
experimental  prototypes  or  commercial  implementations  are  relatively  scarce.  The  few  experimental  studies 
available  have  shown  that  building  rollback-recovery  protocols  with  low  failure-free  overhead  is  feasible.  These 
studies  also  provide  ample  evidence  that  the  main  difficulty  in  implementing  these  protocols  lies  in  the  complexity 
of  handling  recovery  [18].  It  is  interesting  that  all  commercial  implementations  of  message  logging  use  pessimistic 
logging  because  it  simplifies  recovery  [11][27]. 

Several  recent  studies  have  also  challenged  some  premises  on  which  many  rollback-recovery  protocols  rely. 
Many  of  these  protocols  have  been  incepted  in  the  1980’s,  when  processor  speed  and  network  bandwidth  were  such 
that  communication  overhead  was  deemed  too  high,  especially  when  compared  to  the  cost  of  stable  storage  access 
[10].  In  such  platforms,  multi-host  coordination  incurs  a  large  overhead  because  of  the  necessary  control  messages. 
A  protocol  that  does  not  require  such  communication  overhead  at  the  expense  of  more  stable  storage  access 
performs  better  in  such  platforms.  Recently,  processor  speed  and  network  bandwidth  have  increased  dramatically, 
while  the  speed  of  stable  storage  access  has  remained  relatively  the  same.*  This  change  in  the  equation  suggests  a 


’  While  semiconductor-based  stable  .storage  is  becoming  more  widely  available,  the  size-cost  ratio  is  too  low  compared  to  disk- 
based  stable  storage.  It  appears  that  for  .some  time  to  come,  disk-based  systems  will  continue  to  be  the  medium  of  choice  for 
storing  the  large  files  that  are  needed  in  checkpointing  and  logging  systems. 
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fresh  look  at  the  premises  of  many  rollback-recovery  protocols  and  recent  results  have  shown  that 
[1][18][28][39][43][52][54]: 

•  Stable  storage  access  is  now  the  major  source  of  overhead  in  checkpointing  or  message  logging  systems. 
Communication  overhead  is  much  lower  in  comparison.  Such  changes  favor  coordinated  checkpointing 
schemes  over  message  logging  or  uncoordinated  checkpointing  systems,  as  they  require  less  access  to  stable 
storage  and  are  simpler  to  implement. 

•  The  case  for  message  logging  has  become  the  ability  to  interact  with  the  outside  world,  instead  of  reducing  the 
overhead  of  multi-process  coordination  [21].  Message  logging  systems  can  implement  efficient  protocols  for 
committing  output  and  logging  input  that  are  not  possible  in  checkpoint-only  systems. 

•  Recent  advances  have  shown  that  arbitrary  forms  of  nondeterminism  can  be  supported  at  a  very  low  overhead  in 
logging  systems.  Nondeterminism  was  deemed  one  of  the  complexities  inherent  in  message  logging  systems. 

In  the  remainder  of  this  section,  we  address  these  and  other  issues  in  some  detail. 

5.2  Checkpointing  Implementation 

All  available  studies  have  shown  that  writing  the  state  of  a  process  to  stable  storage  is  the  largest  contributor  to  the 
performance  overhead  [43].  The  simplest  way  to  save  the  state  of  a  process  is  to  suspend  execution,  save  the 
process’s  address  space  on  stable  storage,  and  then  resume  execution  [57].  This  scheme  can  be  costly  for  programs 
with  large  address  spaces  if  stable  storage  is  implemented  using  magnetic  disks,  as  it  is  the  custom.  Several 
techniques  exist  to  reduce  this  overhead. 

5.2.1  Concurrent  Checkpointing 

All  available  studies  show  that  concurrent  checkpointing  greatly  reduces  the  overhead  of  saving  the  state  of  a 
process  [23] [43].  Concurrent  checkpointing  relies  on  the  memory  protection  hardware  available  in  modem 
computer  architectures  to  continue  the  execution  of  the  process  while  its  checkpoint  is  being  saved  on  stable  storage. 
The  address  space  is  protected  from  further  modification  at  the  start  of  a  checkpoint  and  the  memory  pages  are  saved 
to  disk  concurrently  with  the  program  execution.  If  the  program  attempts  to  modify  a  page,  it  incurs  a  protection 
violation.  The  checkpointing  system  copies  the  page  into  a  separate  buffer  from  which  it  is  saved  on  stable  storage. 
The  original  page  is  unprotected  and  the  application  program  is  allowed  to  resume.  Implementations  that  do  not 
incorporate  concurrent  checkpointing  may  pay  a  heavy  performance  overhead  unless  the  checkpointing  interval  is 
set  to  a  large  value,  which  in  turn  would  increase  the  time  for  rollback. 

5.2.2  Incremental  Checkpointing 

Adding  incremental  checkpointing  [22]  to  concurrent  checkpointing  can  further  reduce  the  overhead  [20]. 
Incremental  checkpointing  avoids  rewriting  portions  of  the  process  states  that  do  not  change  between  consecutive 
checkpoints.  It  can  be  implemented  by  using  the  dirty-bit  of  the  memory  protection  hardware  or  by  emulating  a 
dirty-bit  in  software  [4].  A  public  domain  package  implementing  this  technique  along  with  concurrent 
checkpointing  is  available  [44]. 

Incremental  checkpointing  can  also  be  extended  over  several  processes.  In  this  technique,  the  system  saves 
the  computed  parity  or  some  function  of  the  memory  pages  that  are  modified  across  several  processes  [45].  This 
technique  is  very  similar  to  parity  computation  in  RAID  disk  systems.  The  parity  pages  can  be  saved  in  volatile 
memory  of  some  other  processes  thereby  avoiding  the  need  to  access  stable  storage.  The  storage  overhead  of  this 
method  is  very  low,  and  it  can  be  adjusted  depending  on  how  many  failures  the  system  is  willing  to  tolerate. 

Another  technique  for  implementing  incremental  checkpointing  is  to  directly  compare  the  program’s  state 
with  the  previous  checkpoint  in  software,  and  writing  the  difference  in  a  new  checkpoint  [46].  The  required  storage 
and  computation  overhead  to  perform  such  a  comparison  may  waste  the  benefit  of  incremental  checkpointing. 
Another  variation  on  this  technique  is  to  use  probabilistic  checkpointing  [40].  The  unit  of  checkpointing  in  this 
scheme  is  a  memory  block  that  is  typically  much  smaller  than  a  memory  page.  Changes  to  a  memory  block  are 
detected  by  computing  a  signature  and  comparing  it  to  the  corresponding  signature  in  the  previous  checkpoint. 
Probabilistic  checkpointing  is  portable,  and  has  lower  storage  and  computation  requirements  than  required  by 
comparing  the  checkpoints  directly.  On  the  downside,  computing  a  signature  to  detect  changes  opens  the  door  for 
aliasing.  This  problem  occurs  when  the  computed  signature  does  not  differ  from  the  corresponding  one  in  the 
previous  checkpoint,  even  though  the  associated  memory  block  has  changed.  In  such  a  situation,  the  memory  block 
is  excluded  from  the  new  checkpoint,  which  therefore  becomes  erroneous.  A  probabilistic  analysis  has  shown  that 
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the  likelihood  of  aliasing  in  practice  is  negligible,  but  an  experimental  evaluation  has  shown  that  probabilistic 
checkpointing  is  unsafe  in  practice  [19]. 

5.2.3  System-level  versus  User-level  Implementations 

Support  for  checkpointing  can  be  implemented  in  the  kernel  [7][11][18][28],  or  it  can  be  implemented  by  a  library 
linked  with  the  user  program  [1][23][26][44].  KerneMevel  implementations  are  more  powerful  because  they  can 
also  capture  kernel  data  structures  that  support  the  user  process.  However,  these  implementations  are  necessarily 
not  portable. 

Checkpointing  can  also  be  implemented  in  user  level.  System  calls  that  manipulate  memory  protection  such 
as  mprotect  of  UNIX  can  emulate  concurrent  and  incremental  checkpointing.  The  fork  system  call  of  UNIX  can 
implement  concurrent  checkpointing  if  the  operating  system  implements  fork  using  copy-on~write  protection  [23]. 
User-level  implementations,  however,  cannot  access  kernel’s  data  structures  that  belong  to  the  process,  such  as  open 
file  descriptors  and  message  buffers,  but  these  data  structures  can  be  emulated  at  user  level  [26]. 

5.2.4  Compiler  Support 

A  compiler  can  be  instrumented  to  generate  code  that  supports  checkpointing  [36],  A  compiled  program  would 
contain  code  that  decides  when  and  what  to  checkpoint.  The  advantage  of  this  technique  is  that  the  compiler  can 
decide  on  the  variables  that  must  be  saved,  therefore  avoiding  unnecessary  data.  For  example,  dead  variables  within 
a  program  are  not  saved  in  a  checkpoint  though  they  have  been  modified.  Furthermore,  the  compiler  may  decide  the 
points  during  program  execution  where  the  amount  of  state  to  be  saved  is  small. 

Despite  these  promising  advantages,  there  are  difficulties  with  this  approach.  It  is  generally  undecidable  to  find  the 
point  in  program  execution  most  suitable  to  take  a  checkpoint.  There  are,  however,  several  heuristics  that  can  be 
used.  The  programmer  can  provide  hints  to  the  compiler  about  where  checkpoints  should  be  inserted  or  what  data 
variables  should  be  stored  [8]  [44].  The  compiler  may  also  be  trained  by  running  the  application  in  an  iterative 
manner  and  observing  its  behavior  [36].  The  observed  behavior  could  help  decide  the  execution  points  where  it 
would  be  appropriate  to  insert  checkpoints.  Compiler  support  could  also  be  simplified  in  languages  that  support 
automatic  garbage  collection  [3].  The  execution  point  after  each  major  garbage  collection  provides  a  convenient 
place  to  take  a  checkpoint  at  a  minimum  cost, 

5.2.5  Checkpoint  Placement 

A  large  amount  of  work  has  been  devoted  to  analyzing  and  deriving  the  optimal  checkpointing  frequency  and 
placement  [15].  The  problem  is  usually  formulated  as  an  optimization  problem  subject  to  constraints.  For  instance, 
a  system  may  attempt  to  reduce  the  number  of  checkpoints  taken  subject  to  a  certain  limit  on  the  amount  of  expected 
rollback.  Generally,  it  has  been  observed  in  practice  that  the  overhead  of  checkpointing  is  usually  negligible  unless 
the  checkpointing  interval  is  relatively  small,  and  therefore  the  optimality  f  checkpoint  placement  is  rarely  an  issue 
in  practical  systems  [20]. 

5.3  Checkpointing  Protocols  in  Comparison 

Many  checkpointing  protocols  were  incepted  at  a  time  where  the  communication  overhead  far  exceeded  the 
overhead  of  accessing  stable  storage.  Furthermore,  the  memory  available  to  run  processes  tended  to  be  small. 
These  tradeoffs  naturally  favored  uncoordinated  checkpointing  schemes  over  coordinated  checkpointing  schemes. 
Current  technological  trends  however  have  reversed  this  tradeoff. 

In  modem  systems,  the  overhead  of  coordinating  checkpoints  is  negligible  compared  to  the  overhead  of 
saving  the  states  [1][18][28][39][43][52].  Using  concurrent  and  incremental  checkpointing,  the  overhead  of  either 
coordinated  or  uncoordinated  checkpointing  is  essentially  the  same.  Therefore,  uncoordinated  checkpointing  is  not 
likely  to  be  an  attractive  technique  in  practice  given  the  negligible  performance  gains.  These  gains  do  not  justify  the 
complexities  of  finding  a  consistent  recovery  line  after  the  failure,  the  susceptibility  to  the  domino  effect,  the  high 
storage  overhead  of  saving  multiple  checkpoints  of  each  process,  and  the  overhead  of  garbage  collection.  It  follows 
that  coordinated  checkpointing  is  superior  to  uncoordinated  checkpointing  when  all  aspects  are  considered  on  the 
balance. 

A  recent  study  has  also  shed  some  light  on  the  behavior  of  communication-induced  checkpointing  [2].  It 
presents  an  analysis  of  these  protocols  based  on  a  prototype  implementation  and  validated  simulations,  showing  that 
communication-induced  checkpointing  does  not  scale  well  as  the  number  of  processes  increases.  The  occurrence  of 
forced  checkpoints  at  random  points  within  the  execution  due  to  communication  messages  makes  it  very  difficult  to 
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predict  the  required  amount  of  stable  storage  for  a  particular  application  run.  Also,  this  unpredictability  affects  the 
policy  for  placing  local  checkpoints  and  makes  CIC  protocols  cumbersome  to  use  in  practice.  Furthermore,  the 
study  shows  that  the  benefit  of  autonomy  in  allowing  processes  to  take  local  checkpoints  at  their  convenience  does 
not  seem  to  hold.  In  all  experiments,  a  process  takes  at  least  twice  as  many  forced  checkpoints  as  local,  autonomous 
ones. 

5.4  Communication  Protocols 

Rollback  recovery  complicates  the  implementation  of  protocols  used  for  inter-process  communications.  Some 
protocols  offer  the  abstraction  of  reliable  communication  channels  such  as  connection-based  protocols  (e.g.,  TCP, 
RPC).  Alternatively,  other  protocols  offer  the  abstraction  of  an  unreliable  datagram  service  (e.g.,  UDP).  Each  type 
of  abstraction  requires  additional  support  to  operate  properly  across  failures  and  recoveries. 

5.4.1  Location-Independent  Identities  and  Redirection 

For  all  communication  protocols,  a  rollback-recovery  system  must  mask  the  actual  identity  and  location  of  processes 
or  remote  ports  from  the  application  program.  This  masking  is  necessary  to  prevent  any  application  program  from 
acquiring  a  dependency  on  the  location  of  a  certain  process,  making  it  impossible  to  restart  the  process  on  a  different 
machine  after  a  failure.  A  solution  to  this  problem  is  to  assign  a  logical,  location-independent  identifier  to  each 
process  in  the  system.  This  scheme  also  allows  the  system  to  redirect  communication  appropriately  to  a  restarting 
process  after  a  failure  [18], 

5.4.2  Reliable  Channel  Protocols 

After  a  failure,  identity  masking  and  communication  redirection  are  sufficient  for  communication  protocols  that 
offer  the  abstraction  of  an  unreliable  channel.  Protocols  that  offer  the  abstraction  of  reliable  channels  require 
additional  support.  These  protocols  usually  generate  a  timeout  upcall  to  the  application  program  if  the  process  at  the 
other  end  of  the  channel  has  failed.  These  timeouts  should  be  masked  since  the  failed  program  will  soon  restart  and 
resume  computation.  If  such  upcalls  are  allowed  to  affect  the  application,  then  the  abstraction  of  a  reliable  system  is 
no  longer  upheld.  The  application  will  have  to  encode  the  necessary  support  to  communicate  with  the  failed  process 
after  it  recovers. 

Masking  timeouts  should  also  be  coupled  with  the  ability  of  the  protocol  implementation  to  reestablish  the 
connection  with  the  restarting  process  (possibly  restarting  on  a  different  machine).  This  support  includes  the  ability 
to  clean  up  the  old  connection  in  an  orderly  manner,  and  to  establish  a  new  connection  with  the  restarting  host. 
Furthermore,  messages  retransmitted  as  part  of  the  execution  replay  of  the  remote  host  must  be  identified  and,  if 
necessary,  suppressed.  This  requires  the  protocol  implementation  to  include  a  form  of  sequence  number  that  is  only 
used  for  this  purpose. 

Recovering  in-transit  messages  that  are  lost  because  of  a  failure  is  another  problem  for  reliable 
communication  protocols.  In  TCP/IP  communication  style,  for  instance,  a  message  is  considered  delivered  once  an 
acknowledgment  is  received  from  the  remote  host.  The  message  itself  may  linger  in  the  kernel’s  buffer  for  a  while 
before  the  receiving  process  consumes  it.  If  this  process  fails,  the  in-transit  messages  must  be  resent  to  preserve  the 
semantics  of  the  reliable  communication  channel.  Messages  must  be  saved  at  the  sender  side  for  possible 
retransmission  during  recovery.  This  step  can  be  combined  in  a  system  that  performs  sender-based  message  logging 
as  part  of  the  log  maintenance.  In  other  systems  that  do  not  log  messages  or  log  messages  at  the  receiver,  the 
copying  of  each  message  at  the  sender  side  introduces  overhead  and  complexity.  The  complexity  is  due  to  the  need 
for  executing  some  garbage  collection  algorithm  with  other  sites  to  reclaim  the  volatile  storage- 

5.5  Log-based  Recovery 

5.5.1  Message  Logging  Overhead 

Message  logging  introduces  three  sources  of  overhead.  First,  each  message  must  in  general  be  copied  to  the  local 
memory  of  the  process.  Second,  the  volatile  log  is  regularly  flushed  on  stable  storage  to  free  up  space.  Third, 
message  logging  nearly  doubles  the  communication  bandwidth  required  to  run  the  application  for  systems  that 
implement  stable  storage  via  a  highly-available  file  system  accessible  through  the  network.  The  first  source  of 
overhead  may  directly  affect  communication  throughput  and  latency.  This  is  especially  true  if  the  copying  occurs  in 
the  critical  path  of  the  inter-process  communication  protocol.  In  this  respect,  sender-based  logging  is  considered 
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more  efficient  than  receiver-based  logging  because  the  copying  can  take  place  after  sending  the  message  over  the 
network.  Additionally,  the  system  may  combine  the  logging  of  messages  with  the  implementation  of  the 
communication  protocol  and  share  the  message  log  with  the  transmission  buffers.  This  scheme  avoids  the  extra 
copying  of  the  message.  Logging  at  the  receiver  is  more  expensive  because  it  is  in  the  critical  path  of  the 
communication  protocol. 

Another  optimization  for  sender-based  logging  systems  is  to  use  copy-on-write  to  avoid  making  extra- 
copying  [21].  This  scheme  works  well  in  systems  where  broadcast  messages  are  implemented  using  several  point- 
to-point  messages.  In  this  case,  copy-on-write  will  allow  the  system  to  have  one  copy  for  identical  messages  and 
thus  reduce  the  storage  and  performance  overhead  of  logging.  No  similar  optimization  can  be  performed  in 
receiver-based  systems  [21]. 

5.5.2  Combining  Log-Based  Recovery  with  Coordinated  Checkpointing 

Log-based  recovery  has  been  traditionally  presented  as  a  mechanism  to  allow  the  use  of  uncoordinated 
checkpointing  with  no  domino  effect.  But  a  system  may  also  combine  event  logging  with  coordinated 
checkpointing,  yielding  several  benefits  with  respect  to  performance  and  simplicity  [21],  These  benefits  include 
those  of  coordinated  checkpointing — ^such  as  the  simplicity  of  recovery  and  garbage  collection — and  those  of  log- 
based  recovery— such  as  fast  output  commit.  Most  prominently,  this  combination  obviates  the  need  for  flushing  the 
volatile  message  logs  to  stable  storage  in  a  sender-based  logging  implementation.  Thus,  there  is  no  need  for 
maintaining  large  logs  on  stable  storage,  resulting  lower  performance  overhead  and  simpler  implementations.  The 
combination  of  coordinated  checkpointing  and  message  logging  has  been  shown  to  outperform  one  with 
uncoordinated  checkpointing  and  message  logging  [21].  Therefore,  the  purpose  of  logging  should  no  longer  be  to 
allow  uncoordinated  checkpointing.  Rather,  it  should  be  the  desire  for  enabling  fast  output  commit  for  those 
applications  that  need  this  feature. 

5.6  Stable  Storage 

Magnetic  disks  have  been  the  medium  of  choice  for  implementing  stable  storage  [35].  Although  they  are  slow,  their 
storage  capacity  and  low  cost  combination  cannot  be  matched  with  other  alternatives.  An  implementation  of  a 
stable  storage  abstraction  on  top  of  a  conventional  file  system  may  not  be  the  best  choice,  however.  Such  an 
implementation  will  not  generally  give  the  performance  or  reliability  needed  to  implement  stable  storage 
[6]  [18]  [49].  Modem  file  systems  tend  to  be  optimized  for  the  pattern  of  access  expected  in  workstation  or  personal 
computing  environments.  Furthermore,  these  file  systems  are  often  accessed  through  a  network  via  a  protocol  that  is 
optimized  for  small  file  accesses  and  not  for  the  large  file  accesses  that  are  more  common  in  checkpointing  and 
logging. 

An  implementation  of  stable  storage  should  bypass  the  file  system  layer  and  access  the  disk  directly.  One 
such  implementation  is  the  KitLog  package,  which  offers  a  log  abstraction  that  can  support  checkpointing  and 
message  logging  [49].  The  package  runs  in  conventional  UNIX  systems  and  bypasses  the  UNIX  file  system  by 
accessing  the  disk  in  raw  mode.  There  have  been  also  several  attempts  at  implementing  stable  storage  using  non¬ 
volatile  semiconductor  memory  [6].  Such  implementations  do  not  have  the  performance  problems  associated  with 
disks.  The  price  and  the  small  storage  capacity  remain  two  problems  that  limit  their  wide  acceptance. 

5.7  Support  for  Nondeterminism 

Nondeterminism  occurs  when  the  application  program  interacts  with  the  operating  system  through  system  calls  and 
upcalls.  The  PWD  assumption  inherent  in  log-based  recovery  systems  stipulates  that  these  nondeterministic  events 
be  logged  on  stable  storage  so  that  they  can  be  replayed  during  recovery.  Log-based  recovery  systems  differ  in  the 
range  of  actual  events  that  can  be  covered  under  the  PWD  assumption.  There  are  two  main  sources  of 
nondeterminism  in  actual  systems,  namely  system  calls  and  asynchronous  events. 

5.7.1  System  Calls 

System  calls  in  general  can  be  classified  into  three  types  [11][18][23].  Idempotent  system  calls  are  those  that  renim 
deterministic  values  whenever  executed.  Examples  include  calls  that  return  the  user  identifier  of  the  process  owner. 
These  calls  do  not  need  to  be  logged.  A  second  class  of  calls  consists  of  those  that  must  be  logged  during  failure- 
free  operation  but  should  not  be  re-executed  during  execution  replay.  The  result  from  these  calls  should  simply  be 
replayed  to  the  application  program.  These  calls  include  those  that  inquire  about  the  environment,  such  as  getting 
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the  current  time  of  day.  Re-executing  these  calls  during  recovery  might  return  a  different  value  that  is  inconsistent 
with  the  pre-failure  execution.  The  last  type  of  system  calls  are  those  that  must  be  logged  during  failure-free 
operation  and  re-executed  during  execution  replay.  These  calls  generally  modify  the  environment  and  therefore  they 
must  be  re-executed  to  re-establish  the  environment  changes.  Examples  include  calls  that  allocate  memory  or  create 
processes.  Ensuring  that  these  calls  return  the  same  values  and  generate  the  same  effect  during  re-execution  can  be 
very  complex. 

5.7.2  Asynchronous  Signals 

Nondeterminism  results  from  asynchronous  signals  available  in  the  form  of  software  interrupts  under  various 
operating  systems.  Such  signals  must  be  applied  at  the  same  execution  points  during  replay  to  reproduce  the  same 
result.  Log-based  rollback  recovery  can  cover  this  form  of  nondeterminism  by  taking  a  checkpoint  after  the 
occurrence  of  each  signal,  which  can  be  very  expensive  [7].  Alternatively,  the  system  may  convert  these 
asynchronous  signals  to  synchronous  messages  such  as  in  Targon/32  [11],  or  it  may  queue  the  signals  until  the 
application  polls  for  them.  Both  alternatives  convert  asynchronous  event  notifications  into  synchronous  ones,  which 
may  not  be  suitable  or  efficient  for  many  applications.  Such  solutions  also  require  substantial  modifications  to  the 
operating  system  or  the  application  program. 

Another  example  of  nondeterminism  that  is  difficult  to  track  is  shared  memory  manipulation  in  multi¬ 
threaded  applications.  Reconstructing  the  same  execution  during  replay  requires  the  same  interleaving  of  shared 
memory  accesses  by  the  various  threads  as  in  the  pre-failure  execution.  Systems  that  support  this  form  of 
nondeterminism  supply  their  own  sets  of  locking  primitives,  and  require  applications  to  use  them  for  protecting 
access  to  shared  memory  [23].  The  primitives  are  instrumented  to  insert  an  entry  in  the  log  identifying  the  calling 
thread  and  the  nature  of  the  synchronization  operation.  However,  this  technique  has  several  problems.  It  makes 
shared  memory  access  expensive,  and  may  generate  a  large  volume  of  data  in  the  log.  Furthermore,  if  the 
application  does  not  adhere  to  the  synchronization  model  (because  of  a  programmer’s  error,  for  instance),  execution 
replay  may  not  be  possible. 

A  technique  for  tracking  nondeterminism  due  to  asynchronous  signals  and  interleaved  memory  access  on 
single  processor  systems  is  to  use  instruction  counters  [13].  An  instruction  counter  is  a  register  that  decrements  by 
one  upon  the  execution  of  each  instruction,  leading  the  hardware  to  generate  an  exception  when  the  register  contents 
become  0.  An  instruction  counter  can  thus  be  used  in  two  modes.  In  one  mode,  the  register  is  loaded  with  the 
number  of  instructions  to  be  executed  before  a  breakpoint  occurs.  After  the  CPU  executes  the  specified  number  of 
instructions,  the  counter  reaches  0  and  the  hardware  generates  an  exception.  The  operating  system  fields  the 
exception  and  executes  a  pre-specified  handler.  This  mode  is  useful  in  setting  breakpoints  efficiently,  such  as  during 
debugging.  In  the  second  mode,  the  instruction  counter  is  loaded  with  the  maximum  value  it  can  hold.  Execution 
proceeds  until  an  event  of  interest  occurs,  at  which  time  the  content  of  the  counter  is  sampled,  and  the  number  of 
instructions  executed  since  the  time  the  counter  was  set  is  computed  and  logged.  The  use  of  instruction  counters  has 
been  suggested  for  debugging  shared  memory  parallel  programs  [37]. 

Instruction  counters  can  be  used  in  rollback  recovery  to  track  the  number  of  instructions  that  occur  between 
asynchronous  interrupts  [54].  These  instruction  counts  are  logged  as  part  of  the  log  that  describes  the 
nondeterministic  events.  During  recovery,  the  system  recovers  the  instruction  counts  from  the  log  and  uses  them  to 
regenerate  the  software  intermpts  at  the  same  execution  points  within  the  application  as  before  the  failure.  The 
application  therefore  goes  through  the  same  set  of  asynchronous  events  precisely  as  it  did  before  the  failure,  and 
therefore  it  can  reconstruct  its  execution. 

An  instruction  counter  can  be  implemented  in  hardware,  as  in  the  PA-RISC  precision  architecture  where  it 
has  been  used  to  augment  the  hypervisor  of  a  virtual-machine  manager  and  coordinate  a  primary  virtual  machine 
with  its  backup  [13].  It  also  can  be  emulated  in  software  [37].  An  implementation  study  shows  that  the  overhead  of 
program  instrumentation  and  tracking  nondeterminism  is  less  than  6%  for  a  variety  of  user  programs  and  synthetic 
benchmarks  [54]. 

5.8  Dependency  Tracking 

Rollback-recovery  protocols  vary  in  the  ways  they  track  inter-process  dependencies.  Some  protocols  require 
tagging  only  an  index  or  a  sequence  number  on  every  application  messages  [14],  while  some  require  the  propagation 
of  a  vector  of  timestamps  [56].  At  an  extreme,  some  protocols  require  the  propagation  of  a  graph  describing  the 
history  of  the  computation  [18],  or  matrices  containing  bit  or  timestamp  vectors  [5]. 
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Tagging  a  message  with  an  index  or  a  sequence  number  on  an  application  message  is  simple  and  does  not 
cause  any  measurable  overhead.  When  dependency  tracking,  however,  requires  more  complex  structures,  there  are 
techniques  for  reducing  the  amount  of  actual  data  that  need  to  be  transferred  on  top  of  each  message.  All  these 
techniques  revolve  around  two  themes.  First,  only  incremental  changes  need  to  be  sent.  If  some  elements  of  a 
vector  or  a  graph  haven’t  changed  since  process  p  has  sent  a  message  to  process  q,  then  p  need  only  include  those 
elements  that  have  changed.  Implementation  of  this  optimization  is  straightforward  in  systems  that  assume  FIFO 
communication  channels.  When  lossy  channels  are  assumed,  this  optimization  is  still  possible,  but  at  the  expense  of 
more  processing  overhead  [18]. 

The  other  technique  for  reducing  the  overhead  of  dependency  tracking  exploits  the  semantics  of  the 
applications  and  the  communication  patterns  [18].  For  instance,  if  it  can  be  inferred  from  the  dependency 
information  available  to  process  p  that  process  q  already  knows  parts  of  the  information  that  is  to  be  piggybacked  on 
a  message  outgoing  to  q,  then  process  p  can  exclude  this  information.  Surprisingly,  implementing  this  optimization 
is  simple  and  yields  good  performance  [18].  Regardless  of  the  particular  method  used  to  track  inter-process 
dependencies,  various  prototype  implementations  have  shown  that  the  overhead  resulting  from  tracking  is  negligible 
compared  to  the  overhead  of  checkpointing  or  logging  [i][2][9][l  1][18][23][28][48][52]. 

5.9  Recovery 

Handling  execution  restart  and  replay  is  a  difficult  part  of  implementing  a  rollback-recovery  system  [11].  The  major 
challenge  is  reintegrating  the  recovered  process  in  a  computation  environment  that  may  or  may  not  be  the  same  as 
the  one  in  which  the  process  was  executing  before  failure. 

5.9.1  Reinstating  a  Process  in  its  Environment 

Implanting  a  process  in  a  different  environment  during  recovery  can  create  difficulties  if  its  state  depends  on  the  pre¬ 
failure  environment.  For  example,  the  process  may  need  to  access  files  that  exist  on  the  local  disk  of  the  machine. 
The  simplest  solution  to  this  problem  is  to  attempt  to  restart  the  program  on  the  same  host.  If  this  is  not  feasible, 
then  the  system  must  insulate  the  process  from  environment-specific  variables  [18].  This  can  be  done  for  instance 
by  intercepting  system  calls  that  return  environment-specific  results  and  replacing  them  with  abstract  values  under 
the  control  of  the  recovery  system.  Also,  file  access  could  be  made  highly  available  by  placing  all  files  in  network¬ 
wide  highly  available  file  servers  or  by  using  dual-ported  disks. 

Another  problem  in  implementing  recovery  is  the  need  to  reconstruct  the  auxiliary  state  within  the  operating 
system  kernel  that  supports  the  recovering  process  [18][26][28][43].  This  state  is  usually  outside  of  the  recovery 
protocol’s  control  during  failure-free  operation,  unless  the  protocol  is  implemented  inside  the  operating  system.  For 
protocols  implemented  outside  the  operating  system,  the  rollback-recovery  system  must  emulate  these  data 
structures  and  log  sufficient  information  to  be  able  to  recreate  them  during  recovery. '  For  example,  the  recovery 
system  may  imitate  the  open  file  table  of  a  particular  process  by  intercepting  all  file  manipulation  calls  from  the 
process  itself.  Then,  the  recovery  system  records  some  information  that  would  enable  it  to  issue  requests  to  the 
operating  system  during  recovery  in  order  to  force  the  operating  system  to  recreate  these  data  structure  indirectly. 
Obviously,  not  all  state  within  the  operating  system  kernel  can  be  emulated  this  way,  and  therefore,  out-of-kemel 
implementations  will  have  to  have  stricter  coverage  of  the  system’s  state  that  need  to  be  emulated.  Since  most  of  the 
applications  that  benefit  from  rollback  recovery  seem  to  be  in  the  realm  of  scientific  computing  where  no 
sophisticated  state  is  maintained  by  the  kernel  on  behalf  of  the  processes,  this  problem  does  not  seem  to  be  severe  in 
that  particular  context  [44]. 

5.9.2  Behavior  During  Recovery 

Previous  studies  have  outlined  several  characteristics  of  rollback-recovery  systems  during  recovery  [18][48].  For 
example,  it  has  been  observed  that  for  log-based  recovery  systems,  the  messages  and  determinants  available  in  the 
logs  are  replayed  at  a  considerably  higher  speed  during  recovery  than  during  normal  execution.  This  is  because 
during  normal  execution  a  process  may  have  to  block  waiting  for  messages  or  synchronization  events,  while  during 
recovery  these  messages  or  events  can  be  immediately  replayed. 

Also,  it  has  been  observed  that  sender-based  logging  protocols  typically  slow  down  recovery  if  there  are 
multiple  failures,  because  of  the  need  to  re-execute  some  of  the  processes  under  control  to  regenerate  the  messages. 
Moreover,  some  of  these  protocols  may  require  sympathetic  rollbacks  [56],  which  increase  the  overhead  of 
synchronizing  the  processes  during  recovery. 
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This  experimental  evidence  seems  to  confirm  the  tradeoff  between  protocols  that  perform  well  during  failure- 
free  executions,  such  as  causal  and  optimistic  logging,  and  protocols  that  perform  well  during  recovery,  such  as 
pessimistic  logging  [48].  It  is  possible  to  address  this  tradeoff  by  performing  logging  both  at  the  sender  and 
receivers  [56],  such  that  the  sender  log  is  volatile  and  is  kept  only  until  the  receiver  flushes  its  volatile  logs  to  stable 
storage. 

5.10  Rollback  Recovery  in  Practice 

Despite  the  wealth  of  research  in  the  area  of  rollback  recovery  in  distributed  systems,  very  few  commercial  systems 
actually  have  adopted  them.  Difficulties  in  implementing  recovery  perhaps  are  the  main  reason  why  these  protocols 
have  not  been  widely  adopted.  Additionally,  the  range  of  applications  that  benefit  from  these  protocols  tend  to  be  in 
the  realm  of  long-running,  scientific  programs,  which  are  relatively  few.  Many  of  these,  in  fact,  are  written  to  run 
on  supercomputers  where  some  facility  exists  for  checkpointing  the  entire  system’s  state.  For  the  few  that  run  in  a 
distributed  system,  public  domain  libraries  that  implement  checkpointing  have  proved  adequate  [44]. 

Log-based  recovery  seemed  to  have  less  success  than  checkpoint-only  systems.  A  commercial 
implementation  of  pessimistic  logging  did  not  fare  well,  although  the  reasons  are  not  clear  [11].  One  could 
conjecture  that  the  complex  modifications  made  to  the  operating  system  and  the  special-purpose  hardware  that  was 
used  to  mitigate  performance  overhead  made  the  machine  expensive.  Some  other  usage  of  log-based  recovery  has 
been  reported  in  telecommunication  applications  [26],  although  there  are  no  reports  on  how  they  fared. 
Interestingly,  both  commercial  implementations  used  pessimistic  logging,  and  were  used  for  applications  where  the 
performance  overhead  of  this  form  of  logging  could  be  tolerated.  We  are  not  aware,  however,  of  any  use  of 
optimistic  or  causal  logging  rollback-recovery  protocols  in  commercial  systems. 

6  Concluding  Remarks 

We  have  reviewed  and  compared  different  approaches  to  rollback  recovery  with  respect  to  a  set  of  properties 
including  the  assumption  of  piecewise  determinism,  performance  overhead,  storage  overhead,  ease  of  output 
commit,  ease  of  garbage  collection,  ease  of  recovery,  freedom  from  domino  effect,  freedom  from  orphan  processes, 
and  the  extent  of  rollback.  These  approaches  fall  into  two  broad  categories:  checkpointing  protocols  and  log-based 
recovery  protocols. 

Checkpointing  protocols  require  the  processes  to  take  periodic  checkpoints  with  varying  degrees  of 
coordination.  At  one  end  of  the  spectrum,  coordinated  checkpointing  requires  the  processes  to  coordinate  their 
checkpoints  to  form  global  consistent  system  states.  Coordinated  checkpointing  generally  simplifies  recovery  and 
garbage  collection,  and  yields  good  performance  in  practice.  At  the  other  end  of  the  spectrum,  uncoordinated 
checkpointing  does  not  require  the  processes  to  coordinate  their  checkpoints,  but  it  suffers  from  potential  domino 
effect,  complicates  recovery,  and  still  requires  coordination  to  perform  output  commit  or  garbage  collection. 
Between  these  two  ends  are  communication-induced  checkpointing  schemes  that  depend  on  the  communication 
patterns  of  the  applications  to  trigger  checkpoints.  These  schemes  do  not  suffer  from  the  domino  effect  and  do  not 
require  coordination.  Recent  studies,  however,  have  shown  that  the  nondeterministic  nature  of  these  protocols 
complicates  garbage  collection  and  degrades  performance. 

Log-based  rollback  recovery  based  on  the  FWD  assumption  is  often  a  natural  choice  for  applications  that 
frequently  interact  with  the  outside  world.  Log-based  recovery  allows  efficient  output  commit,  and  has  three 
flavors,  pessimistic,  optimistic,  and  causal.  The  simplicity  of  pessimistic  logging  makes  it  attractive  for  practical 
applications  if  a  high  failure-free  overhead  can  be  tolerated.  This  form  of  logging  simplifies  recovery,  output 
commit,  and  protect  surviving  processes  from  having  to  roll  back.  These  advantages  have  made  pessimistic  logging 
attractive  in  commercial  environment  where  simplicity  and  robustness  are  necessary.  Causal  logging  can  be 
employed  to  reduce  the  overhead  while  still  preserving  the  properties  of  fast  output  commit  and  orphan-free 
recovery.  Alternatively,  optimistic  logging  reduces  the  overhead  further  at  the  expense  of  complicating  recovery 
and  increasing  the  extent  of  rollback  upon  a  failure. 
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